Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
2006.09199
Cited By
v1
v2 (latest)
AVLnet: Learning Audio-Visual Language Representations from Instructional Videos
16 June 2020
Andrew Rouditchenko
Angie Boggust
David Harwath
Brian Chen
D. Joshi
Samuel Thomas
Kartik Audhkhasi
Hilde Kuehne
Yikang Shen
Rogerio Feris
Brian Kingsbury
M. Picheny
Antonio Torralba
James R. Glass
SSL
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"AVLnet: Learning Audio-Visual Language Representations from Instructional Videos"
50 / 111 papers shown
Leveraging Auxiliary Information in Text-to-Video Retrieval: A Review
A. Fragomeni
Dima Damen
Michael Wray
268
0
0
29 May 2025
CAV-MAE Sync: Improving Contrastive Audio-Visual Mask Autoencoders via Fine-Grained Alignment
Computer Vision and Pattern Recognition (CVPR), 2025
Edson Araujo
Andrew Rouditchenko
Yuan Gong
Saurabhchand Bhati
Samuel Thomas
Brian Kingsbury
Leonid Karlinsky
Rogerio Feris
James Glass
Hilde Kuehne
474
6
0
02 May 2025
A Review on Large Language Models for Visual Analytics
Navya Sonal Agarwal
Sanjay Kumar Sonbhadra
414
8
0
19 Mar 2025
Prompt2LVideos: Exploring Prompts for Understanding Long-Form Multimodal Videos
Soumya Jahagirdar
Jayasree Saha
C. V. Jawahar
405
0
0
11 Mar 2025
Enhancing Explainability with Multimodal Context Representations for Smarter Robots
Anargh Viswanath
Lokesh Veeramacheneni
Hendrik Buschmeier
195
1
0
28 Feb 2025
A Survey of Recent Advances and Challenges in Deep Audio-Visual Correlation Learning
ACM Computing Surveys (ACM CSUR), 2024
Luis Vilaca
Yi Yu
Paula Vinan
539
3
0
24 Nov 2024
Contrasting with Symile: Simple Model-Agnostic Representation Learning for Unlimited Modalities
Neural Information Processing Systems (NeurIPS), 2024
A. Saporta
N. Jethani
Mark Goldstein
Rajesh Ranganath
SSL
298
13
0
01 Nov 2024
You Only Speak Once to See
IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2024
Wenhao Yang
Jianguo Wei
Wenhuan Lu
Lei Li
VOS
330
6
0
27 Sep 2024
Multimodal Emotion Recognition using Audio-Video Transformer Fusion with Cross Attention
Joe Dhanith
Shravan Venkatraman
Modigari Narendra
Vigya Sharma
499
11
0
26 Jul 2024
Translating speech with just images
Dan Oneaţă
Herman Kamper
VLM
253
1
0
11 Jun 2024
VidMuse: A Simple Video-to-Music Generation Framework with Long-Short-Term Modeling
Zeyue Tian
Zhaoyang Liu
Ruibin Yuan
Jiahao Pan
Xiaoqiang Huang
Xu Tan
Xu Tan
Qifeng Chen
Xu Tan
VGen
705
35
0
06 Jun 2024
AVFF: Audio-Visual Feature Fusion for Video Deepfake Detection
Trevine Oorloff
Surya Koppisetti
Nicolo Bonettini
Divyaraj Solanki
Ben Colman
Yaser Yacoob
Ali Shahriyari
Gaurav Bharaj
423
94
0
05 Jun 2024
Coupled Mamba: Enhanced Multi-modal Fusion with Coupled State Space Model
Wenbing Li
Hang Zhou
Junqing Yu
Zikai Song
Wei Yang
Mamba
326
38
0
28 May 2024
CLIP-Powered TASS: Target-Aware Single-Stream Network for Audio-Visual Question Answering
Yuanyuan Jiang
Jianqin Yin
360
5
0
13 May 2024
Unified Video-Language Pre-training with Synchronized Audio
Shentong Mo
Haofan Wang
Huaxia Li
Xu Tang
299
2
0
12 May 2024
Unifying Latent and Lexicon Representations for Effective Video-Text Retrieval
Haowei Liu
Yaya Shi
Haiyang Xu
Chunfen Yuan
Qinghao Ye
...
Mingshi Yan
Ji Zhang
Fei Huang
Bing Li
Weiming Hu
253
1
0
26 Feb 2024
Event-aware Video Corpus Moment Retrieval
Danyang Hou
Liang Pang
Huawei Shen
Xueqi Cheng
355
4
0
21 Feb 2024
Comment-aided Video-Language Alignment via Contrastive Pre-training for Short-form Video Humor Detection
Yang Liu
Tongfei Shen
Dong Zhang
Qingying Sun
Shoushan Li
Guodong Zhou
297
5
0
14 Feb 2024
FunnyNet-W: Multimodal Learning of Funny Moments in Videos in the Wild
International Journal of Computer Vision (IJCV), 2024
Zhi-Song Liu
Robin Courant
Vicky Kalogeiton
404
11
0
08 Jan 2024
Mirasol3B: A Multimodal Autoregressive model for time-aligned and contextual modalities
Computer Vision and Pattern Recognition (CVPR), 2023
A. Piergiovanni
Isaac Noble
Dahun Kim
Michael S. Ryoo
Victor Gomes
A. Angelova
474
26
0
09 Nov 2023
HowToCaption: Prompting LLMs to Transform Video Annotations at Scale
European Conference on Computer Vision (ECCV), 2023
Nina Shvetsova
Anna Kukleva
Xudong Hong
Christian Rupprecht
Bernt Schiele
Hilde Kuehne
376
33
0
07 Oct 2023
Video-adverb retrieval with compositional adverb-action embeddings
British Machine Vision Conference (BMVC), 2023
Thomas Hummel
Otniel-Bogdan Mercea
A. Sophia Koepke
Zeynep Akata
230
1
0
26 Sep 2023
TMac: Temporal Multi-Modal Graph Learning for Acoustic Event Classification
ACM Multimedia (ACM MM), 2023
Meng Liu
K. Liang
Dayu Hu
Hao Yu
Yue Liu
Lingyuan Meng
Wenxuan Tu
Sihang Zhou
Xinwang Liu
344
40
0
21 Sep 2023
Zero-shot Audio Topic Reranking using Large Language Models
Spoken Language Technology Workshop (SLT), 2023
Mengjie Qian
Rao Ma
Adian Liusie
Erfan Loweimi
Kate Knill
Mark Gales
246
1
0
14 Sep 2023
Preserving Modality Structure Improves Multi-Modal Learning
IEEE International Conference on Computer Vision (ICCV), 2023
Swetha Sirnam
Mamshad Nayeem Rizve
Nina Shvetsova
Hilde Kuehne
M. Shah
287
14
0
24 Aug 2023
CLIPSonic: Text-to-Audio Synthesis with Unlabeled Videos and Pretrained Language-Vision Models
IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 2023
Hao-Wen Dong
Xiaoyu Liu
Jordi Pons
Gautam Bhattacharya
Santiago Pascual
Joan Serrà
Taylor Berg-Kirkpatrick
Julian McAuley
DiffM
341
28
0
16 Jun 2023
Language-Guided Music Recommendation for Video via Prompt Analogies
Computer Vision and Pattern Recognition (CVPR), 2023
Daniel McKee
Justin Salamon
Josef Sivic
Bryan C. Russell
VGen
314
33
0
15 Jun 2023
Learning to Ground Instructional Articles in Videos through Narrations
IEEE International Conference on Computer Vision (ICCV), 2023
E. Mavroudi
Triantafyllos Afouras
Lorenzo Torresani
DiffM
303
27
0
06 Jun 2023
VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset
Neural Information Processing Systems (NeurIPS), 2023
Sihan Chen
Handong Li
Qunbo Wang
Zijia Zhao
Ming-Ting Sun
Xinxin Zhu
Qingbin Liu
585
202
0
29 May 2023
LANISTR: Multimodal Learning from Structured and Unstructured Data
Sayna Ebrahimi
Sercan O. Arik
Yihe Dong
Tomas Pfister
354
10
0
26 May 2023
Target-Aware Spatio-Temporal Reasoning via Answering Questions in Dynamics Audio-Visual Scenarios
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023
Yuanyuan Jiang
Jianqin Yin
239
9
0
21 May 2023
Syllable Discovery and Cross-Lingual Generalization in a Visually Grounded, Self-Supervised Speech Model
Interspeech (Interspeech), 2023
Puyuan Peng
Shang-Wen Li
Okko Räsänen
Abdel-rahman Mohamed
David Harwath
SSL
VLM
330
11
0
19 May 2023
VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset
IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2023
Sihan Chen
Xingjian He
Longteng Guo
Xinxin Zhu
Weining Wang
Jinhui Tang
Jinhui Tang
VLM
525
173
0
17 Apr 2023
Instance-Level Trojan Attacks on Visual Question Answering via Adversarial Learning in Neuron Activation Space
IEEE International Joint Conference on Neural Network (IJCNN), 2023
Yuwei Sun
H. Ochiai
Jun Sakuma
AAML
350
6
0
02 Apr 2023
Hindi as a Second Language: Improving Visually Grounded Speech with Semantically Similar Samples
IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2023
H. Ryu
Arda Senocak
In So Kweon
Joon Son Chung
VLM
352
12
0
30 Mar 2023
What, when, and where? -- Self-Supervised Spatio-Temporal Grounding in Untrimmed Multi-Action Videos from Narrated Instructions
Computer Vision and Pattern Recognition (CVPR), 2023
Brian Chen
Nina Shvetsova
Andrew Rouditchenko
D. Kondermann
Samuel Thomas
Shih-Fu Chang
Rogerio Feris
James R. Glass
Hilde Kuehne
400
11
0
29 Mar 2023
Language-Guided Audio-Visual Source Separation via Trimodal Consistency
Computer Vision and Pattern Recognition (CVPR), 2023
Reuben Tan
Arijit Ray
Andrea Burns
Bryan A. Plummer
Justin Salamon
Oriol Nieto
Bryan C. Russell
Kate Saenko
280
31
0
28 Mar 2023
Structured Video-Language Modeling with Temporal Grouping and Spatial Grounding
International Conference on Learning Representations (ICLR), 2023
Yuanhao Xiong
Long Zhao
Boqing Gong
Ming-Hsuan Yang
Florian Schroff
Ting Liu
Cho-Jui Hsieh
Liangzhe Yuan
VLM
350
0
0
28 Mar 2023
Learning Grounded Vision-Language Representation for Versatile Understanding in Untrimmed Videos
Teng Wang
Jinrui Zhang
Feng Zheng
Wenhao Jiang
Ran Cheng
Ping Luo
VLM
318
15
0
11 Mar 2023
What You Say Is What You Show: Visual Narration Detection in Instructional Videos
Kumar Ashutosh
Rohit Girdhar
Lorenzo Torresani
Kristen Grauman
447
5
0
05 Jan 2023
Multi-queue Momentum Contrast for Microvideo-Product Retrieval
Web Search and Data Mining (WSDM), 2022
Yali Du
Yin-wei Wei
Wei Ji
Fan Liu
Xin Luo
Liqiang Nie
220
20
0
22 Dec 2022
MAViL: Masked Audio-Video Learners
Neural Information Processing Systems (NeurIPS), 2022
Po-Yao (Bernie) Huang
Vasu Sharma
Hu Xu
Chaitanya K. Ryali
Haoqi Fan
Yanghao Li
Shang-Wen Li
Gargi Ghosh
Jitendra Malik
Christoph Feichtenhofer
465
82
0
15 Dec 2022
SimVTP: Simple Video Text Pre-training with Masked Autoencoders
Yue Ma
Tianyu Yang
Yin Shan
Xiu Li
209
30
0
07 Dec 2022
Perceiver-VL: Efficient Vision-and-Language Modeling with Iterative Latent Attention
IEEE Workshop/Winter Conference on Applications of Computer Vision (WACV), 2022
Zineng Tang
Jaemin Cho
Jie Lei
Joey Tianyi Zhou
VLM
227
10
0
21 Nov 2022
SMAUG: Sparse Masked Autoencoder for Efficient Video-Language Pre-training
IEEE International Conference on Computer Vision (ICCV), 2022
Yuanze Lin
Chen Wei
Huiyu Wang
Alan Yuille
Cihang Xie
3DGS
373
17
0
21 Nov 2022
Cross-Modal Adapter for Vision-Language Retrieval
Pattern Recognition (Pattern Recogn.), 2022
Haojun Jiang
Jianke Zhang
Rui Huang
Chunjiang Ge
Zanlin Ni
Jiwen Lu
Gao Huang
461
43
0
17 Nov 2022
Scaling Multimodal Pre-Training via Cross-Modality Gradient Harmonization
Neural Information Processing Systems (NeurIPS), 2022
Junru Wu
Yi Liang
Feng Han
Hassan Akbari
Zinan Lin
Cong Yu
215
15
0
03 Nov 2022
Unsupervised Audio-Visual Lecture Segmentation
IEEE Workshop/Winter Conference on Applications of Computer Vision (WACV), 2022
Darshan Singh
Anchit Gupta
C. V. Jawahar
Makarand Tapaswi
VOS
298
9
0
29 Oct 2022
Learning Joint Representation of Human Motion and Language
Jihoon Kim
Youngjae Yu
Seungyoung Shin
Taehyun Byun
Sungjoon Choi
225
5
0
27 Oct 2022
Efficient Cross-Modal Video Retrieval with Meta-Optimized Frames
IEEE transactions on multimedia (IEEE TMM), 2022
Ning Han
Xun Yang
Ee-Peng Lim
Hao Chen
Qianru Sun
271
9
0
16 Oct 2022
1
2
3
Next
Page 1 of 3