Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
2312.07533
Cited By
v1
v2
v3
v4 (latest)
VILA: On Pre-training for Visual Language Models
Computer Vision and Pattern Recognition (CVPR), 2023
12 December 2023
Ji Lin
Hongxu Yin
Ming-Yu Liu
Yao Lu
Pavlo Molchanov
Andrew Tao
Huizi Mao
Jan Kautz
Mohammad Shoeybi
Song Han
MLLM
VLM
Re-assign community
ArXiv (abs)
PDF
HTML
HuggingFace (23 upvotes)
Papers citing
"VILA: On Pre-training for Visual Language Models"
50 / 278 papers shown
Title
FALCONEye: Finding Answers and Localizing Content in ONE-hour-long videos with multi-modal LLMs
Carlos Plou
Cesar Borja
Ruben Martinez-Cantin
Ana C. Murillo
290
0
0
25 Mar 2025
LEGO-Puzzles: How Good Are MLLMs at Multi-Step Spatial Reasoning?
Kexian Tang
Junyao Gao
Yanhong Zeng
Haodong Duan
Yanan Sun
Zhening Xing
Wenran Liu
Kaifeng Lyu
Kai-xiang Chen
ELM
LRM
403
26
0
25 Mar 2025
Scaling Vision Pre-Training to 4K Resolution
Computer Vision and Pattern Recognition (CVPR), 2025
Baifeng Shi
Boyi Li
Han Cai
Yaojie Lu
Sifei Liu
...
Jan Kautz
Enze Xie
Trevor Darrell
Pavlo Molchanov
Hongxu Yin
CLIP
873
12
0
25 Mar 2025
Audio-centric Video Understanding Benchmark without Text Shortcut
Yue Yang
Jimin Zhuang
Guangzhi Sun
Changli Tang
Yongqian Li
P. Li
Yifan Jiang
W. Li
Tianhao Shen
Chao Zhang
AuLLM
CoGe
386
0
0
25 Mar 2025
CoMP: Continual Multimodal Pre-training for Vision Foundation Models
Yuxiao Chen
L. Meng
Wujian Peng
Zuxuan Wu
Yu-Gang Jiang
VLM
418
3
0
24 Mar 2025
OThink-MR1: Stimulating multimodal generalized reasoning capabilities via dynamic reinforcement learning
Zhiyuan Liu
Yuting Zhang
Feng Liu
Changwang Zhang
Ying Sun
Jun Wang
LRM
441
20
0
20 Mar 2025
Improving LLM Video Understanding with 16 Frames Per Second
Yongqian Li
Changli Tang
Jimin Zhuang
Yudong Yang
Guangzhi Sun
W. Li
Tianhao Shen
Chao Zhang
VLM
365
11
0
18 Mar 2025
DualToken: Towards Unifying Visual Understanding and Generation with Dual Visual Vocabularies
Wei Song
Longji Xu
Zijia Song
Yadong Li
Haoze Sun
Xin Wu
Guosheng Dong
Jianhua Xu
Jiaqi Wang
Kaicheng Yu
361
16
0
18 Mar 2025
Identifying and Mitigating Position Bias of Multi-image Vision-Language Models
Computer Vision and Pattern Recognition (CVPR), 2025
Xinyu Tian
Shu Zou
Zhaoyuan Yang
Jing Zhang
272
10
0
18 Mar 2025
MM-Spatial: Exploring 3D Spatial Understanding in Multimodal LLMs
Erik Daxberger
Nina Wenzel
David Griffiths
Haiming Gang
Justin Lazarow
...
Kai Kang
Marcin Eichner
Yue Yang
Afshin Dehghan
Peter Grasch
278
30
0
17 Mar 2025
VideoMind: A Chain-of-LoRA Agent for Long Video Reasoning
Wenshu Fan
Kevin Qinghong Lin
C. Chen
Mike Zheng Shou
LM&Ro
LRM
869
35
0
17 Mar 2025
TIME: Temporal-Sensitive Multi-Dimensional Instruction Tuning and Robust Benchmarking for Video-LLMs
Yunxiao Wang
Meng Liu
Rui Shao
Haoyu Zhang
Bin Wen
Fan Yang
Yan Li
Di Zhang
Liqiang Nie
Liqiang Nie
229
5
0
13 Mar 2025
Memory-enhanced Retrieval Augmentation for Long Video Understanding
Huaying Yuan
Zhengyang Liang
Minhao Qin
Hongjin Qian
Yan Shu
Zhicheng Dou
Ji-Rong Wen
Andrii Zadaianchuk
VOS
RALM
VLM
294
9
0
12 Mar 2025
Reasoning is All You Need for Video Generalization: A Counterfactual Benchmark with Sub-question Evaluation
Annual Meeting of the Association for Computational Linguistics (ACL), 2025
Qiji Zhou
Yifan Gong
Guangsheng Bao
Hongjie Qiu
Jinqiang Li
Xiangrong Zhu
Huajian Zhang
Yue Zhang
LRM
234
3
0
12 Mar 2025
MMRL: Multi-Modal Representation Learning for Vision-Language Models
Computer Vision and Pattern Recognition (CVPR), 2025
Yuncheng Guo
Xiaodong Gu
VLM
OffRL
907
19
0
11 Mar 2025
EMMOE: A Comprehensive Benchmark for Embodied Mobile Manipulation in Open Environments
Dongping Li
Tielong Cai
Tianci Tang
Wenhao Chai
Katherine Rose Driggs-Campbell
Gaoang Wang
LM&Ro
542
2
0
11 Mar 2025
Decoupled Cross-Modal Alignment Network for Text-RGBT Person Retrieval and A High-Quality Benchmark
Yifei Deng
Zhengyu Chen
Ziheng Xu
Chenglong Li
Jin Tang
258
0
0
11 Mar 2025
EgoBlind: Towards Egocentric Visual Assistance for the Blind
Junbin Xiao
Nanxin Huang
Hao Qiu
Zhulin Tao
Xun Yang
Richang Hong
Ming Wang
Angela Yao
EgoV
VLM
409
7
0
11 Mar 2025
Adaptive Audio-Visual Speech Recognition via Matryoshka-Based Multimodal LLMs
Umberto Cappellazzo
Minsu Kim
Stavros Petridis
347
7
0
09 Mar 2025
StreamMind: Unlocking Full Frame Rate Streaming Video Dialogue through Event-Gated Cognition
Xin Ding
Hao Wu
Yue Yang
Shiqi Jiang
Donglin Bai
Zhibo Chen
Ting Cao
901
9
0
08 Mar 2025
Merge then Realign: Simple and Effective Modality-Incremental Continual Learning for Multimodal LLMs
Dingkun Zhang
Shuhan Qi
Xinyu Xiao
Kehai Chen
Xuan Wang
CLL
MoMe
265
0
0
08 Mar 2025
EgoLife: Towards Egocentric Life Assistant
Computer Vision and Pattern Recognition (CVPR), 2025
Jingkang Yang
Shuai Liu
Hongming Guo
Yuhao Dong
Xinyu Zhang
...
Joerg Widmer
Francesco Gringoli
Lei Yang
Bo Li
Ziwei Liu
EgoV
230
12
0
05 Mar 2025
See What You Are Told: Visual Attention Sink in Large Multimodal Models
International Conference on Learning Representations (ICLR), 2025
Seil Kang
Jinyeong Kim
Junhyeok Kim
Seong Jae Hwang
VLM
310
44
0
05 Mar 2025
Seeing is Understanding: Unlocking Causal Attention into Modality-Mutual Attention for Multimodal LLMs
Wei-Yao Wang
Zhao Wang
Helen Suzuki
Yoshiyuki Kobayashi
LRM
310
5
0
04 Mar 2025
MV-MATH: Evaluating Multimodal Math Reasoning in Multi-Visual Contexts
Computer Vision and Pattern Recognition (CVPR), 2025
Peijie Wang
Zhong-Zhi Li
Fei Yin
Xin Yang
Dekang Ran
Cheng-Lin Liu
LRM
550
26
0
28 Feb 2025
Raccoon: Multi-stage Diffusion Training with Coarse-to-Fine Curating Videos
Zhiyu Tan
Junyan Wang
Hao Yang
Luozheng Qin
Hesen Chen
Qiang-feng Zhou
Hao Li
VGen
332
3
0
28 Feb 2025
Nexus: An Omni-Perceptive And -Interactive Model for Language, Audio, And Vision
Che Liu
Yingji Zhang
D. Zhang
Weijie Zhang
Chenggong Gong
...
Junwei Liao
Haipang Wu
Ji Liu
André Freitas
Qifan Wang
AuLLM
523
6
0
26 Feb 2025
olmOCR: Unlocking Trillions of Tokens in PDFs with Vision Language Models
Jake Poznanski
Crystal Nam
Jon Borchardt
Jason Dunkelberger
Regan Huff
Daniel Lin
Aman Rangapur
Christopher Wilhelm
Kyle Lo
Luca Soldaini
548
33
0
25 Feb 2025
SVBench: A Benchmark with Temporal Multi-Turn Dialogues for Streaming Video Understanding
International Conference on Learning Representations (ICLR), 2025
Zhenyu Yang
Yihan Hu
Zemin Du
Dizhan Xue
Chuanrui Hu
Jiahong Wu
Fan Yang
Weiming Dong
Changsheng Xu
284
25
0
15 Feb 2025
Vision-Language Models for Edge Networks: A Comprehensive Survey
IEEE Internet of Things Journal (IEEE IoT J.), 2025
Ahmed Sharshar
Latif U. Khan
Waseem Ullah
Mohsen Guizani
VLM
361
3
0
11 Feb 2025
HAMSTER: Hierarchical Action Models For Open-World Robot Manipulation
International Conference on Learning Representations (ICLR), 2025
Yi Li
Yuquan Deng
Jing Zhang
Joel Jang
Marius Memme
...
Fabio Ramos
Dieter Fox
Anqi Li
Abhishek Gupta
Ankit Goyal
LM&Ro
687
64
0
08 Feb 2025
AutoGUI: Scaling GUI Grounding with Automatic Functionality Annotations from LLMs
Annual Meeting of the Association for Computational Linguistics (ACL), 2025
Hongxin Li
Jingfan Chen
Jingran Su
Yuntao Chen
Qing Li
Rundong Wang
957
8
0
04 Feb 2025
Visual Attention Never Fades: Selective Progressive Attention ReCalibration for Detailed Image Captioning in Multimodal Large Language Models
Mingi Jung
Saehuyng Lee
Eunji Kim
Sungroh Yoon
956
6
0
03 Feb 2025
Improving Video Generation with Human Feedback
Jie Liu
Gongye Liu
Jiajun Liang
Ziyang Yuan
Xiaokun Liu
...
Fei Yang
Pengfei Wan
Di Zhang
Kun Gai
Yujiu Yang
VGen
EGVM
426
97
0
23 Jan 2025
CBVLM: Training-free Explainable Concept-based Large Vision Language Models for Medical Image Classification
Computers in Biology and Medicine (Comput. Biol. Med.), 2025
Cristiano Patrício
Isabel Rio-Torto
J. S. Cardoso
Luís F. Teixeira
João C. Neves
VLM
1.0K
3
0
21 Jan 2025
Omni-RGPT: Unifying Image and Video Region-level Understanding via Token Marks
Computer Vision and Pattern Recognition (CVPR), 2025
Miran Heo
Min-Hung Chen
De-An Huang
Sifei Liu
Subhashree Radhakrishnan
Seon Joo Kim
Yu-Chun Wang
Ryo Hachiuma
ObjD
VLM
504
8
0
14 Jan 2025
LEO: Boosting Mixture of Vision Encoders for Multimodal Large Language Models
Mozhgan Nasr Azadani
James Riddell
Sean Sedwards
Krzysztof Czarnecki
MLLM
VLM
211
7
0
13 Jan 2025
TB-Bench: Training and Testing Multi-Modal AI for Understanding Spatio-Temporal Traffic Behaviors from Dashcam Images/Videos
Korawat Charoenpitaks
Van-Quang Nguyen
Masanori Suganuma
Kentaro Arai
Seiji Totsuka
Hiroshi Ino
Takayuki Okatani
VLM
112
2
0
10 Jan 2025
RAG-Check: Evaluating Multimodal Retrieval Augmented Generation Performance
Matin Mortaheb
M. A. Khojastepour
S. Chakradhar
S. Ulukus
VLM
RALM
177
7
0
08 Jan 2025
Automated Generation of Challenging Multiple-Choice Questions for Vision Language Model Evaluation
Computer Vision and Pattern Recognition (CVPR), 2025
Yuhui Zhang
Yuchang Su
Yiming Liu
Xiaohan Wang
James Burgess
...
Josiah Aklilu
Alejandro Lozano
Anjiang Wei
Ludwig Schmidt
Serena Yeung-Levy
371
18
0
06 Jan 2025
VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM
Computer Vision and Pattern Recognition (CVPR), 2024
Yuqian Yuan
Hang Zhang
Wentong Li
Zesen Cheng
Boqiang Zhang
...
Deli Zhao
Wenqiao Zhang
Yueting Zhuang
Jianke Zhu
Lidong Bing
364
36
0
31 Dec 2024
HoVLE: Unleashing the Power of Monolithic Vision-Language Models with Holistic Vision-Language Embedding
Computer Vision and Pattern Recognition (CVPR), 2024
Chenxin Tao
Shiqian Su
X. Zhu
Chenyu Zhang
Zhe Chen
...
Wenhai Wang
Lewei Lu
Gao Huang
Yu Qiao
Jifeng Dai
MLLM
VLM
455
5
0
20 Dec 2024
Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces
Computer Vision and Pattern Recognition (CVPR), 2024
Jihan Yang
Shusheng Yang
Anjali W. Gupta
Rilyn Han
Li Fei-Fei
Saining Xie
LRM
458
324
0
18 Dec 2024
AgentPS: Agentic Process Supervision for Content Moderation with Multimodal LLMs
Gorden Liu
Yu Sun
R.-H. Sun
Xin Dong
Hongyu Xiong
Hongyu Xiong
LLMAG
198
1
0
15 Dec 2024
Black Swan: Abductive and Defeasible Video Reasoning in Unpredictable Events
Computer Vision and Pattern Recognition (CVPR), 2024
Aditya Chinchure
Sahithya Ravi
R. Ng
Vered Shwartz
Boyang Albert Li
Leonid Sigal
ReLM
LRM
VLM
377
7
0
07 Dec 2024
EgoPlan-Bench2: A Benchmark for Multimodal Large Language Model Planning in Real-World Scenarios
Lu Qiu
Yuying Ge
Yi Chen
Yixiao Ge
Mingyu Ding
Xihui Liu
LLMAG
LRM
348
18
0
05 Dec 2024
AIM: Adaptive Inference of Multi-Modal LLMs via Token Merging and Pruning
Yiwu Zhong
Zhuoming Liu
Yin Li
Liwei Wang
394
19
0
04 Dec 2024
Progress-Aware Video Frame Captioning
Computer Vision and Pattern Recognition (CVPR), 2024
Zihui Xue
Joungbin An
Xitong Yang
Kristen Grauman
540
6
0
03 Dec 2024
VLsI: Verbalized Layers-to-Interactions from Large to Small Vision Language Models
Computer Vision and Pattern Recognition (CVPR), 2024
Byung-Kwan Lee
Ryo Hachiuma
Yu-Chiang Frank Wang
Y. Ro
Yueh-Hua Wu
VLM
365
5
0
02 Dec 2024
On Domain-Adaptive Post-Training for Multimodal Large Language Models
Daixuan Cheng
Shaohan Huang
Ziyu Zhu
Xintong Zhang
Wayne Xin Zhao
Zhongzhi Luan
Bo Dai
Zhenliang Zhang
VLM
415
5
0
29 Nov 2024
Previous
1
2
3
4
5
6
Next