ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2006.09199
  4. Cited By
AVLnet: Learning Audio-Visual Language Representations from
  Instructional Videos
v1v2 (latest)

AVLnet: Learning Audio-Visual Language Representations from Instructional Videos

16 June 2020
Andrew Rouditchenko
Angie Boggust
David Harwath
Brian Chen
D. Joshi
Samuel Thomas
Kartik Audhkhasi
Hilde Kuehne
Yikang Shen
Rogerio Feris
Brian Kingsbury
M. Picheny
Antonio Torralba
James R. Glass
    SSL
ArXiv (abs)PDFHTML

Papers citing "AVLnet: Learning Audio-Visual Language Representations from Instructional Videos"

50 / 111 papers shown
Leveraging Auxiliary Information in Text-to-Video Retrieval: A Review
Leveraging Auxiliary Information in Text-to-Video Retrieval: A Review
A. Fragomeni
Dima Damen
Michael Wray
268
0
0
29 May 2025
CAV-MAE Sync: Improving Contrastive Audio-Visual Mask Autoencoders via Fine-Grained Alignment
CAV-MAE Sync: Improving Contrastive Audio-Visual Mask Autoencoders via Fine-Grained AlignmentComputer Vision and Pattern Recognition (CVPR), 2025
Edson Araujo
Andrew Rouditchenko
Yuan Gong
Saurabhchand Bhati
Samuel Thomas
Brian Kingsbury
Leonid Karlinsky
Rogerio Feris
James Glass
Hilde Kuehne
474
6
0
02 May 2025
A Review on Large Language Models for Visual Analytics
A Review on Large Language Models for Visual Analytics
Navya Sonal Agarwal
Sanjay Kumar Sonbhadra
414
8
0
19 Mar 2025
Prompt2LVideos: Exploring Prompts for Understanding Long-Form Multimodal Videos
Prompt2LVideos: Exploring Prompts for Understanding Long-Form Multimodal Videos
Soumya Jahagirdar
Jayasree Saha
C. V. Jawahar
405
0
0
11 Mar 2025
Enhancing Explainability with Multimodal Context Representations for Smarter Robots
Enhancing Explainability with Multimodal Context Representations for Smarter Robots
Anargh Viswanath
Lokesh Veeramacheneni
Hendrik Buschmeier
195
1
0
28 Feb 2025
A Survey of Recent Advances and Challenges in Deep Audio-Visual Correlation LearningACM Computing Surveys (ACM CSUR), 2024
Luis Vilaca
Yi Yu
Paula Vinan
539
3
0
24 Nov 2024
Contrasting with Symile: Simple Model-Agnostic Representation Learning
  for Unlimited Modalities
Contrasting with Symile: Simple Model-Agnostic Representation Learning for Unlimited ModalitiesNeural Information Processing Systems (NeurIPS), 2024
A. Saporta
N. Jethani
Mark Goldstein
Rajesh Ranganath
SSL
298
13
0
01 Nov 2024
You Only Speak Once to See
You Only Speak Once to SeeIEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2024
Wenhao Yang
Jianguo Wei
Wenhuan Lu
Lei Li
VOS
330
6
0
27 Sep 2024
Multimodal Emotion Recognition using Audio-Video Transformer Fusion with Cross Attention
Multimodal Emotion Recognition using Audio-Video Transformer Fusion with Cross Attention
Joe Dhanith
Shravan Venkatraman
Modigari Narendra
Vigya Sharma
499
11
0
26 Jul 2024
Translating speech with just images
Translating speech with just images
Dan Oneaţă
Herman Kamper
VLM
253
1
0
11 Jun 2024
VidMuse: A Simple Video-to-Music Generation Framework with Long-Short-Term Modeling
VidMuse: A Simple Video-to-Music Generation Framework with Long-Short-Term Modeling
Zeyue Tian
Zhaoyang Liu
Ruibin Yuan
Jiahao Pan
Xiaoqiang Huang
Xu Tan
Xu Tan
Qifeng Chen
Xu Tan
VGen
705
35
0
06 Jun 2024
AVFF: Audio-Visual Feature Fusion for Video Deepfake Detection
AVFF: Audio-Visual Feature Fusion for Video Deepfake Detection
Trevine Oorloff
Surya Koppisetti
Nicolo Bonettini
Divyaraj Solanki
Ben Colman
Yaser Yacoob
Ali Shahriyari
Gaurav Bharaj
423
94
0
05 Jun 2024
Coupled Mamba: Enhanced Multi-modal Fusion with Coupled State Space
  Model
Coupled Mamba: Enhanced Multi-modal Fusion with Coupled State Space Model
Wenbing Li
Hang Zhou
Junqing Yu
Zikai Song
Wei Yang
Mamba
326
38
0
28 May 2024
CLIP-Powered TASS: Target-Aware Single-Stream Network for Audio-Visual
  Question Answering
CLIP-Powered TASS: Target-Aware Single-Stream Network for Audio-Visual Question Answering
Yuanyuan Jiang
Jianqin Yin
360
5
0
13 May 2024
Unified Video-Language Pre-training with Synchronized Audio
Unified Video-Language Pre-training with Synchronized Audio
Shentong Mo
Haofan Wang
Huaxia Li
Xu Tang
299
2
0
12 May 2024
Unifying Latent and Lexicon Representations for Effective Video-Text
  Retrieval
Unifying Latent and Lexicon Representations for Effective Video-Text Retrieval
Haowei Liu
Yaya Shi
Haiyang Xu
Chunfen Yuan
Qinghao Ye
...
Mingshi Yan
Ji Zhang
Fei Huang
Bing Li
Weiming Hu
253
1
0
26 Feb 2024
Event-aware Video Corpus Moment Retrieval
Event-aware Video Corpus Moment Retrieval
Danyang Hou
Liang Pang
Huawei Shen
Xueqi Cheng
355
4
0
21 Feb 2024
Comment-aided Video-Language Alignment via Contrastive Pre-training for
  Short-form Video Humor Detection
Comment-aided Video-Language Alignment via Contrastive Pre-training for Short-form Video Humor Detection
Yang Liu
Tongfei Shen
Dong Zhang
Qingying Sun
Shoushan Li
Guodong Zhou
297
5
0
14 Feb 2024
FunnyNet-W: Multimodal Learning of Funny Moments in Videos in the Wild
FunnyNet-W: Multimodal Learning of Funny Moments in Videos in the WildInternational Journal of Computer Vision (IJCV), 2024
Zhi-Song Liu
Robin Courant
Vicky Kalogeiton
404
11
0
08 Jan 2024
Mirasol3B: A Multimodal Autoregressive model for time-aligned and
  contextual modalities
Mirasol3B: A Multimodal Autoregressive model for time-aligned and contextual modalitiesComputer Vision and Pattern Recognition (CVPR), 2023
A. Piergiovanni
Isaac Noble
Dahun Kim
Michael S. Ryoo
Victor Gomes
A. Angelova
474
26
0
09 Nov 2023
HowToCaption: Prompting LLMs to Transform Video Annotations at Scale
HowToCaption: Prompting LLMs to Transform Video Annotations at ScaleEuropean Conference on Computer Vision (ECCV), 2023
Nina Shvetsova
Anna Kukleva
Xudong Hong
Christian Rupprecht
Bernt Schiele
Hilde Kuehne
376
33
0
07 Oct 2023
Video-adverb retrieval with compositional adverb-action embeddings
Video-adverb retrieval with compositional adverb-action embeddingsBritish Machine Vision Conference (BMVC), 2023
Thomas Hummel
Otniel-Bogdan Mercea
A. Sophia Koepke
Zeynep Akata
230
1
0
26 Sep 2023
TMac: Temporal Multi-Modal Graph Learning for Acoustic Event
  Classification
TMac: Temporal Multi-Modal Graph Learning for Acoustic Event ClassificationACM Multimedia (ACM MM), 2023
Meng Liu
K. Liang
Dayu Hu
Hao Yu
Yue Liu
Lingyuan Meng
Wenxuan Tu
Sihang Zhou
Xinwang Liu
344
40
0
21 Sep 2023
Zero-shot Audio Topic Reranking using Large Language Models
Zero-shot Audio Topic Reranking using Large Language ModelsSpoken Language Technology Workshop (SLT), 2023
Mengjie Qian
Rao Ma
Adian Liusie
Erfan Loweimi
Kate Knill
Mark Gales
246
1
0
14 Sep 2023
Preserving Modality Structure Improves Multi-Modal Learning
Preserving Modality Structure Improves Multi-Modal LearningIEEE International Conference on Computer Vision (ICCV), 2023
Swetha Sirnam
Mamshad Nayeem Rizve
Nina Shvetsova
Hilde Kuehne
M. Shah
287
14
0
24 Aug 2023
CLIPSonic: Text-to-Audio Synthesis with Unlabeled Videos and Pretrained
  Language-Vision Models
CLIPSonic: Text-to-Audio Synthesis with Unlabeled Videos and Pretrained Language-Vision ModelsIEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 2023
Hao-Wen Dong
Xiaoyu Liu
Jordi Pons
Gautam Bhattacharya
Santiago Pascual
Joan Serrà
Taylor Berg-Kirkpatrick
Julian McAuley
DiffM
341
28
0
16 Jun 2023
Language-Guided Music Recommendation for Video via Prompt Analogies
Language-Guided Music Recommendation for Video via Prompt AnalogiesComputer Vision and Pattern Recognition (CVPR), 2023
Daniel McKee
Justin Salamon
Josef Sivic
Bryan C. Russell
VGen
314
33
0
15 Jun 2023
Learning to Ground Instructional Articles in Videos through Narrations
Learning to Ground Instructional Articles in Videos through NarrationsIEEE International Conference on Computer Vision (ICCV), 2023
E. Mavroudi
Triantafyllos Afouras
Lorenzo Torresani
DiffM
303
27
0
06 Jun 2023
VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and
  Dataset
VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and DatasetNeural Information Processing Systems (NeurIPS), 2023
Sihan Chen
Handong Li
Qunbo Wang
Zijia Zhao
Ming-Ting Sun
Xinxin Zhu
Qingbin Liu
585
202
0
29 May 2023
LANISTR: Multimodal Learning from Structured and Unstructured Data
LANISTR: Multimodal Learning from Structured and Unstructured Data
Sayna Ebrahimi
Sercan O. Arik
Yihe Dong
Tomas Pfister
354
10
0
26 May 2023
Target-Aware Spatio-Temporal Reasoning via Answering Questions in
  Dynamics Audio-Visual Scenarios
Target-Aware Spatio-Temporal Reasoning via Answering Questions in Dynamics Audio-Visual ScenariosConference on Empirical Methods in Natural Language Processing (EMNLP), 2023
Yuanyuan Jiang
Jianqin Yin
239
9
0
21 May 2023
Syllable Discovery and Cross-Lingual Generalization in a Visually
  Grounded, Self-Supervised Speech Model
Syllable Discovery and Cross-Lingual Generalization in a Visually Grounded, Self-Supervised Speech ModelInterspeech (Interspeech), 2023
Puyuan Peng
Shang-Wen Li
Okko Räsänen
Abdel-rahman Mohamed
David Harwath
SSLVLM
330
11
0
19 May 2023
VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset
VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and DatasetIEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2023
Sihan Chen
Xingjian He
Longteng Guo
Xinxin Zhu
Weining Wang
Jinhui Tang
Jinhui Tang
VLM
525
173
0
17 Apr 2023
Instance-Level Trojan Attacks on Visual Question Answering via
  Adversarial Learning in Neuron Activation Space
Instance-Level Trojan Attacks on Visual Question Answering via Adversarial Learning in Neuron Activation SpaceIEEE International Joint Conference on Neural Network (IJCNN), 2023
Yuwei Sun
H. Ochiai
Jun Sakuma
AAML
350
6
0
02 Apr 2023
Hindi as a Second Language: Improving Visually Grounded Speech with
  Semantically Similar Samples
Hindi as a Second Language: Improving Visually Grounded Speech with Semantically Similar SamplesIEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2023
H. Ryu
Arda Senocak
In So Kweon
Joon Son Chung
VLM
352
12
0
30 Mar 2023
What, when, and where? -- Self-Supervised Spatio-Temporal Grounding in
  Untrimmed Multi-Action Videos from Narrated Instructions
What, when, and where? -- Self-Supervised Spatio-Temporal Grounding in Untrimmed Multi-Action Videos from Narrated InstructionsComputer Vision and Pattern Recognition (CVPR), 2023
Brian Chen
Nina Shvetsova
Andrew Rouditchenko
D. Kondermann
Samuel Thomas
Shih-Fu Chang
Rogerio Feris
James R. Glass
Hilde Kuehne
400
11
0
29 Mar 2023
Language-Guided Audio-Visual Source Separation via Trimodal Consistency
Language-Guided Audio-Visual Source Separation via Trimodal ConsistencyComputer Vision and Pattern Recognition (CVPR), 2023
Reuben Tan
Arijit Ray
Andrea Burns
Bryan A. Plummer
Justin Salamon
Oriol Nieto
Bryan C. Russell
Kate Saenko
280
31
0
28 Mar 2023
Structured Video-Language Modeling with Temporal Grouping and Spatial
  Grounding
Structured Video-Language Modeling with Temporal Grouping and Spatial GroundingInternational Conference on Learning Representations (ICLR), 2023
Yuanhao Xiong
Long Zhao
Boqing Gong
Ming-Hsuan Yang
Florian Schroff
Ting Liu
Cho-Jui Hsieh
Liangzhe Yuan
VLM
350
0
0
28 Mar 2023
Learning Grounded Vision-Language Representation for Versatile
  Understanding in Untrimmed Videos
Learning Grounded Vision-Language Representation for Versatile Understanding in Untrimmed Videos
Teng Wang
Jinrui Zhang
Feng Zheng
Wenhao Jiang
Ran Cheng
Ping Luo
VLM
318
15
0
11 Mar 2023
What You Say Is What You Show: Visual Narration Detection in
  Instructional Videos
What You Say Is What You Show: Visual Narration Detection in Instructional Videos
Kumar Ashutosh
Rohit Girdhar
Lorenzo Torresani
Kristen Grauman
447
5
0
05 Jan 2023
Multi-queue Momentum Contrast for Microvideo-Product Retrieval
Multi-queue Momentum Contrast for Microvideo-Product RetrievalWeb Search and Data Mining (WSDM), 2022
Yali Du
Yin-wei Wei
Wei Ji
Fan Liu
Xin Luo
Liqiang Nie
220
20
0
22 Dec 2022
MAViL: Masked Audio-Video Learners
MAViL: Masked Audio-Video LearnersNeural Information Processing Systems (NeurIPS), 2022
Po-Yao (Bernie) Huang
Vasu Sharma
Hu Xu
Chaitanya K. Ryali
Haoqi Fan
Yanghao Li
Shang-Wen Li
Gargi Ghosh
Jitendra Malik
Christoph Feichtenhofer
465
82
0
15 Dec 2022
SimVTP: Simple Video Text Pre-training with Masked Autoencoders
SimVTP: Simple Video Text Pre-training with Masked Autoencoders
Yue Ma
Tianyu Yang
Yin Shan
Xiu Li
209
30
0
07 Dec 2022
Perceiver-VL: Efficient Vision-and-Language Modeling with Iterative
  Latent Attention
Perceiver-VL: Efficient Vision-and-Language Modeling with Iterative Latent AttentionIEEE Workshop/Winter Conference on Applications of Computer Vision (WACV), 2022
Zineng Tang
Jaemin Cho
Jie Lei
Joey Tianyi Zhou
VLM
227
10
0
21 Nov 2022
SMAUG: Sparse Masked Autoencoder for Efficient Video-Language
  Pre-training
SMAUG: Sparse Masked Autoencoder for Efficient Video-Language Pre-trainingIEEE International Conference on Computer Vision (ICCV), 2022
Yuanze Lin
Chen Wei
Huiyu Wang
Alan Yuille
Cihang Xie
3DGS
373
17
0
21 Nov 2022
Cross-Modal Adapter for Vision-Language Retrieval
Cross-Modal Adapter for Vision-Language RetrievalPattern Recognition (Pattern Recogn.), 2022
Haojun Jiang
Jianke Zhang
Rui Huang
Chunjiang Ge
Zanlin Ni
Jiwen Lu
Gao Huang
461
43
0
17 Nov 2022
Scaling Multimodal Pre-Training via Cross-Modality Gradient
  Harmonization
Scaling Multimodal Pre-Training via Cross-Modality Gradient HarmonizationNeural Information Processing Systems (NeurIPS), 2022
Junru Wu
Yi Liang
Feng Han
Hassan Akbari
Zinan Lin
Cong Yu
215
15
0
03 Nov 2022
Unsupervised Audio-Visual Lecture Segmentation
Unsupervised Audio-Visual Lecture SegmentationIEEE Workshop/Winter Conference on Applications of Computer Vision (WACV), 2022
Darshan Singh
Anchit Gupta
C. V. Jawahar
Makarand Tapaswi
VOS
298
9
0
29 Oct 2022
Learning Joint Representation of Human Motion and Language
Learning Joint Representation of Human Motion and Language
Jihoon Kim
Youngjae Yu
Seungyoung Shin
Taehyun Byun
Sungjoon Choi
225
5
0
27 Oct 2022
Efficient Cross-Modal Video Retrieval with Meta-Optimized Frames
Efficient Cross-Modal Video Retrieval with Meta-Optimized FramesIEEE transactions on multimedia (IEEE TMM), 2022
Ning Han
Xun Yang
Ee-Peng Lim
Hao Chen
Qianru Sun
271
9
0
16 Oct 2022
123
Next
Page 1 of 3