MMMORRF: Multimodal Multilingual Modularized Reciprocal Rank FusionAnnual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), 2025

Saron Samuel

Dan DeGenaro

Jimena Guallar-Blasco

Kate Sanders

...

474

26 Mar 2025

Can Text-to-Video Generation help Video-Language Alignment?Computer Vision and Pattern Recognition (CVPR), 2025

327

24 Mar 2025

EgoDTM: Towards 3D-Aware Egocentric Video-Language Pretraining

546

19 Mar 2025

Continual Multimodal Contrastive Learning

721

19 Mar 2025

Adapting to the Unknown: Training-Free Audio-Visual Event Perception with Dynamic ThresholdsComputer Vision and Pattern Recognition (CVPR), 2025

355

17 Mar 2025

Language-guided Open-world Video Anomaly Detection under Weak Supervision

245

17 Mar 2025

TikZero: Zero-Shot Text-Guided Graphics Program Synthesis

Simone Paolo Ponzetto

625

14 Mar 2025

UVE: Are MLLMs Unified Evaluators for AI-Generated Videos?

852

13 Mar 2025

Quality Over Quantity? LLM-Based Curation for a Data-Efficient Audio-Video Foundation Model

Ali Vosoughi

Dimitra Emmanouilidou

H. Gamper

470

12 Mar 2025

Memory-enhanced Retrieval Augmentation for Long Video Understanding

346

12 Mar 2025

Exo2Ego: Exocentric Knowledge Guided MLLM for Egocentric Video Understanding

519

12 Mar 2025

Continual Learning for Multiple Modalities

Hyundong Jin

Eunwoo Kim

CLL

457

11 Mar 2025

Unveiling the Potential of Segment Anything Model 2 for RGB-Thermal Semantic Segmentation with Language Guidance

362

04 Mar 2025

SafeAuto: Knowledge-Enhanced Safe Autonomous Driving with Multimodal Foundation Models

451

28 Feb 2025

Can Hallucination Correction Improve Video-Language Alignment?Annual Meeting of the Association for Computational Linguistics (ACL), 2025

Lingjun Zhao

Mingyang Xie

Paola Cascante-Bonilla

Hal Daumé III

Kwonjoon Lee

HILM VLM

332

20 Feb 2025

Sce2DriveX: A Generalized MLLM Framework for Scene-to-Drive LearningIEEE Robotics and Automation Letters (IEEE RA-L), 2025

299

19 Feb 2025

Language Models Can See Better: Visual Contrastive Decoding For LLM Multimodal ReasoningIEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2025

244

17 Feb 2025

Uni-Retrieval: A Multi-Style Retrieval Framework for STEM's EducationAnnual Meeting of the Association for Computational Linguistics (ACL), 2025

555

09 Feb 2025

TEOChat: A Large Vision-Language Assistant for Temporal Earth Observation DataInternational Conference on Learning Representations (ICLR), 2024

437

28 Jan 2025

The "Law" of the Unconscious Contrastive Learner: Probabilistic Alignment of Unpaired ModalitiesInternational Conference on Learning Representations (ICLR), 2025

Yongwei Che

Benjamin Eysenbach

315

20 Jan 2025

Audio-Language Datasets of Scenes and Events: A SurveyIEEE Access (IEEE Access), 2024

465

10 Jan 2025

Hierarchical Banzhaf Interaction for General Video-Language Representation LearningIEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2024

395

31 Dec 2024

MMAudio: Taming Multimodal Joint Training for High-Quality Video-to-Audio SynthesisComputer Vision and Pattern Recognition (CVPR), 2024

544

19 Dec 2024

Do Language Models Understand Time?The Web Conference (WWW), 2024

Xi Ding

Lei Wang

948

18 Dec 2024

Gramian Multimodal Representation Learning and AlignmentInternational Conference on Learning Representations (ICLR), 2024

466

16 Dec 2024

Expanding Event Modality Applications through a Robust CLIP-Based Encoder

466

04 Dec 2024

AIM: Adaptive Inference of Multi-Modal LLMs via Token Merging and Pruning

426

04 Dec 2024

OmniFlow: Any-to-Any Generation with Multi-Modal Rectified FlowsComputer Vision and Pattern Recognition (CVPR), 2024

Shufan Li

Konstantinos Kallidromitis

458

02 Dec 2024

VideoSAVi: Self-Aligned Video Language Models without Human Supervision

Yogesh Kulkarni

Pooyan Fazli

VLM

608

01 Dec 2024

WF-VAE: Enhancing Video VAE by Wavelet-Driven Energy Flow for Latent Video Diffusion ModelComputer Vision and Pattern Recognition (CVPR), 2024

522

26 Nov 2024

ReWind: Understanding Long Videos with Instructed Learnable MemoryComputer Vision and Pattern Recognition (CVPR), 2024

377

23 Nov 2024

Tiny-Align: Bridging Automatic Speech Recognition and Large Language Model on the Edge

512

21 Nov 2024

Generative Emotion Cause Explanation in Multimodal ConversationsInternational Conference on Multimedia Retrieval (ICMR), 2024

468

01 Nov 2024

MultiVENT 2.0: A Massive Multilingual Benchmark for Event-Centric Video RetrievalComputer Vision and Pattern Recognition (CVPR), 2024

...

479

15 Oct 2024

Deep Correlated Prompting for Visual Recognition with Missing ModalitiesNeural Information Processing Systems (NeurIPS), 2024

Wei Feng

462

09 Oct 2024

Human-in-the-loop Reasoning For Traffic Sign Detection: Collaborative Approach Yolo With Video-llava

Mehdi Azarafza

Fatima Idrees

Ali Ehteshami Bejnordi

Charles Steinmetz

Stefan Henkler

A. Rettberg

333

07 Oct 2024

Geometric Analysis of Reasoning Trajectories: A Phase Space Approach to Understanding Valid and Invalid Multi-Hop Reasoning in LLMs

Javier Marin

LRM

527

189

06 Oct 2024

LLaVA-Video: Video Instruction Tuning With Synthetic Data

495

248

03 Oct 2024

Anchors Aweigh! Sail for Optimal Unified Multi-Modal Representations

451

02 Oct 2024

Designing Interfaces for Multimodal Vector Search Applications

Owen Pendrigh Elliott

Tom Hamer

Jesse Clark

195

18 Sep 2024

One missing piece in Vision and Language: A Survey on Comics Understanding

Emanuele Vivoli

335

14 Sep 2024

EMO-LLaMA: Enhancing Facial Emotion Understanding with Instruction Tuning

Xin Liu

Jingyu Yang

202

21 Aug 2024

VQ-CTAP: Cross-Modal Fine-Grained Sequence Representation Learning for Speech ProcessingIEEE Transactions on Audio, Speech, and Language Processing (IEEE TASLP), 2024

...

340

11 Aug 2024

VideoQA in the Era of LLMs: An Empirical StudyInternational Journal of Computer Vision (IJCV), 2024

...

352

08 Aug 2024