Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2402.05889
Cited By
CREMA: Generalizable and Efficient Video-Language Reasoning via Multimodal Modular Fusion
8 February 2024
Shoubin Yu
Jaehong Yoon
Mohit Bansal
Re-assign community
ArXiv
PDF
HTML
Papers citing
"CREMA: Generalizable and Efficient Video-Language Reasoning via Multimodal Modular Fusion"
10 / 10 papers shown
Title
Aurelia: Test-time Reasoning Distillation in Audio-Visual LLMs
Sanjoy Chowdhury
Hanan Gani
Nishit Anand
Sayan Nag
Ruohan Gao
Mohamed Elhoseiny
Salman Khan
Dinesh Manocha
LRM
29
0
0
29 Mar 2025
Continually Evolved Multimodal Foundation Models for Cancer Prognosis
Jie Peng
Shuang Zhou
Longwei Yang
Yiran Song
Mohan Zhang
Kaixiong Zhou
Feng Xie
Mingquan Lin
Rui Zhang
Tianlong Chen
74
0
0
30 Jan 2025
OneLLM: One Framework to Align All Modalities with Language
Jiaming Han
Kaixiong Gong
Yiyuan Zhang
Jiaqi Wang
Kaipeng Zhang
D. Lin
Yu Qiao
Peng Gao
Xiangyu Yue
MLLM
96
102
0
10 Jan 2025
Prompting Video-Language Foundation Models with Domain-specific Fine-grained Heuristics for Video Question Answering
Ting Yu
Kunhao Fu
Shuhui Wang
Qingming Huang
Jun Yu
24
0
0
12 Oct 2024
Perception Test: A Diagnostic Benchmark for Multimodal Video Models
Viorica Puatruaucean
Lucas Smaira
Ankush Gupta
Adrià Recasens Continente
L. Markeeva
...
Y. Aytar
Simon Osindero
Dima Damen
Andrew Zisserman
João Carreira
VLM
102
138
0
23 May 2023
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
Junnan Li
Dongxu Li
Silvio Savarese
Steven C. H. Hoi
VLM
MLLM
244
4,186
0
30 Jan 2023
TVLT: Textless Vision-Language Transformer
Zineng Tang
Jaemin Cho
Yixin Nie
Mohit Bansal
VLM
31
28
0
28 Sep 2022
Codified audio language modeling learns useful representations for music information retrieval
Rodrigo Castellon
Chris Donahue
Percy Liang
74
70
0
12 Jul 2021
Pri3D: Can 3D Priors Help 2D Representation Learning?
Ji Hou
Saining Xie
Benjamin Graham
Angela Dai
Matthias Nießner
SSL
3DPC
MDE
75
76
0
22 Apr 2021
Self-supervised Co-training for Video Representation Learning
Tengda Han
Weidi Xie
Andrew Zisserman
SSL
193
371
0
19 Oct 2020
1