v1v2 (latest)

AVLnet: Learning Audio-Visual Language Representations from Instructional Videos

16 June 2020

Kartik Audhkhasi

Antonio Torralba

Papers citing "AVLnet: Learning Audio-Visual Language Representations from Instructional Videos"

50 / 111 papers shown

Leveraging Auxiliary Information in Text-to-Video Retrieval: A Review

A. Fragomeni

Dima Damen

Michael Wray

268

29 May 2025

CAV-MAE Sync: Improving Contrastive Audio-Visual Mask Autoencoders via Fine-Grained AlignmentComputer Vision and Pattern Recognition (CVPR), 2025

474

02 May 2025

A Review on Large Language Models for Visual Analytics

Navya Sonal Agarwal

Sanjay Kumar Sonbhadra

414

19 Mar 2025

Prompt2LVideos: Exploring Prompts for Understanding Long-Form Multimodal Videos

Soumya Jahagirdar

Jayasree Saha

C. V. Jawahar

405

11 Mar 2025

Enhancing Explainability with Multimodal Context Representations for Smarter Robots

Anargh Viswanath

Lokesh Veeramacheneni

Hendrik Buschmeier

195

28 Feb 2025

A Survey of Recent Advances and Challenges in Deep Audio-Visual Correlation LearningACM Computing Surveys (ACM CSUR), 2024

Luis Vilaca

Yi Yu

Paula Vinan

539

24 Nov 2024

Contrasting with Symile: Simple Model-Agnostic Representation Learning for Unlimited ModalitiesNeural Information Processing Systems (NeurIPS), 2024

A. Saporta

N. Jethani

Mark Goldstein

Rajesh Ranganath

SSL

298

01 Nov 2024

You Only Speak Once to SeeIEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2024

Wenhao Yang

Jianguo Wei

Wenhuan Lu

Lei Li

VOS

330

27 Sep 2024

Multimodal Emotion Recognition using Audio-Video Transformer Fusion with Cross Attention

499

26 Jul 2024

Translating speech with just images

Dan Oneaţă

Herman Kamper

VLM

253

11 Jun 2024

VidMuse: A Simple Video-to-Music Generation Framework with Long-Short-Term Modeling

705

06 Jun 2024

AVFF: Audio-Visual Feature Fusion for Video Deepfake Detection

423

05 Jun 2024

Coupled Mamba: Enhanced Multi-modal Fusion with Coupled State Space Model

326

28 May 2024

CLIP-Powered TASS: Target-Aware Single-Stream Network for Audio-Visual Question Answering

Yuanyuan Jiang

Jianqin Yin

360

13 May 2024

Unified Video-Language Pre-training with Synchronized Audio

Shentong Mo

Haofan Wang

Huaxia Li

Xu Tang

299

12 May 2024

Unifying Latent and Lexicon Representations for Effective Video-Text Retrieval

...

Ji Zhang

Fei Huang

Bing Li

Weiming Hu

253

26 Feb 2024

Event-aware Video Corpus Moment Retrieval

355

21 Feb 2024

Comment-aided Video-Language Alignment via Contrastive Pre-training for Short-form Video Humor Detection

297

14 Feb 2024

FunnyNet-W: Multimodal Learning of Funny Moments in Videos in the WildInternational Journal of Computer Vision (IJCV), 2024

Zhi-Song Liu

Robin Courant

Vicky Kalogeiton

404

08 Jan 2024

Mirasol3B: A Multimodal Autoregressive model for time-aligned and contextual modalitiesComputer Vision and Pattern Recognition (CVPR), 2023

474

09 Nov 2023

HowToCaption: Prompting LLMs to Transform Video Annotations at ScaleEuropean Conference on Computer Vision (ECCV), 2023

Nina Shvetsova

Anna Kukleva

Xudong Hong

Christian Rupprecht

Bernt Schiele

Hilde Kuehne

376

07 Oct 2023

Video-adverb retrieval with compositional adverb-action embeddingsBritish Machine Vision Conference (BMVC), 2023

Thomas Hummel

Otniel-Bogdan Mercea

A. Sophia Koepke

Zeynep Akata

230

26 Sep 2023

TMac: Temporal Multi-Modal Graph Learning for Acoustic Event ClassificationACM Multimedia (ACM MM), 2023

344

21 Sep 2023

Zero-shot Audio Topic Reranking using Large Language ModelsSpoken Language Technology Workshop (SLT), 2023

Rao Ma

246

14 Sep 2023

Preserving Modality Structure Improves Multi-Modal LearningIEEE International Conference on Computer Vision (ICCV), 2023

287

24 Aug 2023

CLIPSonic: Text-to-Audio Synthesis with Unlabeled Videos and Pretrained Language-Vision ModelsIEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 2023

Xiaoyu Liu

Taylor Berg-Kirkpatrick

Julian McAuley

DiffM

341

16 Jun 2023

Language-Guided Music Recommendation for Video via Prompt AnalogiesComputer Vision and Pattern Recognition (CVPR), 2023

314

15 Jun 2023

Learning to Ground Instructional Articles in Videos through NarrationsIEEE International Conference on Computer Vision (ICCV), 2023

E. Mavroudi

Triantafyllos Afouras

Lorenzo Torresani

DiffM

303

06 Jun 2023

VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and DatasetNeural Information Processing Systems (NeurIPS), 2023

585

202

29 May 2023

LANISTR: Multimodal Learning from Structured and Unstructured Data

Sayna Ebrahimi

Sercan O. Arik

Yihe Dong

Tomas Pfister

354

26 May 2023

Target-Aware Spatio-Temporal Reasoning via Answering Questions in Dynamics Audio-Visual ScenariosConference on Empirical Methods in Natural Language Processing (EMNLP), 2023

Yuanyuan Jiang

Jianqin Yin

239

21 May 2023

Syllable Discovery and Cross-Lingual Generalization in a Visually Grounded, Self-Supervised Speech ModelInterspeech (Interspeech), 2023

330

19 May 2023

VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and DatasetIEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2023

525

173

17 Apr 2023

Instance-Level Trojan Attacks on Visual Question Answering via Adversarial Learning in Neuron Activation SpaceIEEE International Joint Conference on Neural Network (IJCNN), 2023

350

02 Apr 2023

Hindi as a Second Language: Improving Visually Grounded Speech with Semantically Similar SamplesIEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2023

H. Ryu

Arda Senocak

In So Kweon

Joon Son Chung

VLM

352

30 Mar 2023

What, when, and where? -- Self-Supervised Spatio-Temporal Grounding in Untrimmed Multi-Action Videos from Narrated InstructionsComputer Vision and Pattern Recognition (CVPR), 2023

400

29 Mar 2023

Language-Guided Audio-Visual Source Separation via Trimodal ConsistencyComputer Vision and Pattern Recognition (CVPR), 2023

280

28 Mar 2023

Structured Video-Language Modeling with Temporal Grouping and Spatial GroundingInternational Conference on Learning Representations (ICLR), 2023

Ming-Hsuan Yang

350

28 Mar 2023

Learning Grounded Vision-Language Representation for Versatile Understanding in Untrimmed Videos

Ran Cheng

Ping Luo

VLM

318

11 Mar 2023

What You Say Is What You Show: Visual Narration Detection in Instructional Videos

447

05 Jan 2023

Multi-queue Momentum Contrast for Microvideo-Product RetrievalWeb Search and Data Mining (WSDM), 2022

Wei Ji

220

22 Dec 2022

MAViL: Masked Audio-Video LearnersNeural Information Processing Systems (NeurIPS), 2022

Po-Yao (Bernie) Huang

Christoph Feichtenhofer

465

15 Dec 2022

SimVTP: Simple Video Text Pre-training with Masked Autoencoders

Yue Ma

Tianyu Yang

Yin Shan

Xiu Li

209

07 Dec 2022

Perceiver-VL: Efficient Vision-and-Language Modeling with Iterative Latent AttentionIEEE Workshop/Winter Conference on Applications of Computer Vision (WACV), 2022

227

21 Nov 2022

SMAUG: Sparse Masked Autoencoder for Efficient Video-Language Pre-trainingIEEE International Conference on Computer Vision (ICCV), 2022

Cihang Xie

373

21 Nov 2022

Cross-Modal Adapter for Vision-Language RetrievalPattern Recognition (Pattern Recogn.), 2022

461

17 Nov 2022

Scaling Multimodal Pre-Training via Cross-Modality Gradient HarmonizationNeural Information Processing Systems (NeurIPS), 2022

215

03 Nov 2022

Unsupervised Audio-Visual Lecture SegmentationIEEE Workshop/Winter Conference on Applications of Computer Vision (WACV), 2022

298

29 Oct 2022

Learning Joint Representation of Human Motion and Language

225

27 Oct 2022

Efficient Cross-Modal Video Retrieval with Meta-Optimized FramesIEEE transactions on multimedia (IEEE TMM), 2022

Hao Chen

271

16 Oct 2022