Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
All Papers
0 / 0 papers shown
Title
Home
Papers
2312.07533
Cited By
v1
v2
v3
v4 (latest)
VILA: On Pre-training for Visual Language Models
Computer Vision and Pattern Recognition (CVPR), 2023
12 December 2023
Ji Lin
Hongxu Yin
Ming-Yu Liu
Yao Lu
Pavlo Molchanov
Andrew Tao
Huizi Mao
Jan Kautz
Mohammad Shoeybi
Song Han
MLLM
VLM
Re-assign community
ArXiv (abs)
PDF
HTML
HuggingFace (23 upvotes)
Papers citing
"VILA: On Pre-training for Visual Language Models"
50 / 275 papers shown
Title
Measuring Epistemic Humility in Multimodal Large Language Models
Bingkui Tong
Jiaer Xia
Sifeng Shang
Kaiyang Zhou
HILM
112
2
0
11 Sep 2025
MESH -- Understanding Videos Like Human: Measuring Hallucinations in Large Video Models
Garry Yang
Zizhe Chen
Man Hon Wong
Haoyu Lei
Yongqiang Chen
Zhenguo Li
Kaiwen Zhou
James Cheng
131
0
0
10 Sep 2025
BcQLM: Efficient Vision-Language Understanding with Distilled Q-Gated Cross-Modal Fusion
Sike Xiang
Shuang Chen
Amir Atapour-Abarghouei
MLLM
98
0
0
10 Sep 2025
Mini-o3: Scaling Up Reasoning Patterns and Interaction Turns for Visual Search
Xin Lai
Junyi Li
Wei Li
Tao Liu
Tianjian Li
Hengshuang Zhao
LRM
VLM
97
25
0
09 Sep 2025
BTCChat: Advancing Remote Sensing Bi-temporal Change Captioning with Multimodal Large Language Model
Yujie Li
Wenjia Xu
Yuanben Zhang
Zhiwei Wei
Mugen Peng
64
0
0
07 Sep 2025
Med-RewardBench: Benchmarking Reward Models and Judges for Medical Multimodal Large Language Models
Meidan Ding
Jipeng Zhang
Wenxuan Wang
Cheng-Yi Li
Wei-Chieh Fang
Hsin-Yu Wu
Haiqin Zhong
Wenting Chen
LinLin Shen
61
0
0
29 Aug 2025
DriveQA: Passing the Driving Knowledge Test
Maolin Wei
Wanzhou Liu
Eshed Ohn-Bar
ELM
90
1
0
29 Aug 2025
Improving Alignment in LVLMs with Debiased Self-Judgment
Sihan Yang
Chenhang Cui
Zihao Zhao
Yiyang Zhou
Weilong Yan
Ying Wei
Huaxiu Yao
189
0
0
28 Aug 2025
ChainReaction: Causal Chain-Guided Reasoning for Modular and Explainable Causal-Why Video Question Answering
Paritosh Parmar
Eric Peh
Basura Fernando
VGen
LRM
136
0
0
28 Aug 2025
OwlCap: Harmonizing Motion-Detail for Video Captioning via HMD-270K and Caption Set Equivalence Reward
Chunlin Zhong
Qiuxia Hou
Zhangjun Zhou
Shuang Hao
Haonan Lu
Yanhao Zhang
He Tang
Xiang Bai
VGen
115
2
0
26 Aug 2025
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
Weiyun Wang
Zhangwei Gao
Lixin Gu
Hengjun Pu
Long Cui
...
Bowen Zhou
Kai Chen
Yu Qiao
Wenhai Wang
Gen Luo
MLLM
LRM
270
221
0
25 Aug 2025
Language-Specific Layer Matters: Efficient Multilingual Enhancement for Large Vision-Language Models
Yuchun Fan
Yilin Wang
Yongyu Mu
Daigang Xu
Bei Li
Xiaocheng Feng
Tong Xiao
Jingbo Zhu
76
0
0
25 Aug 2025
Hierarchical Contextual Grounding LVLM: Enhancing Fine-Grained Visual-Language Understanding with Robust Grounding
Leilei Guo
Antonio Carlos Rivera
Peiyu Tang
Haoxuan Ren
Zheyu Song
144
1
0
23 Aug 2025
Mitigating Easy Option Bias in Multiple-Choice Question Answering
Hao Zhang
Chen Li
Basura Fernando
AAML
76
0
0
19 Aug 2025
Mitigating Cross-Image Information Leakage in LVLMs for Multi-Image Tasks
Yeji Park
Minyoung Lee
Sanghyuk Chun
Junsuk Choe
68
0
0
19 Aug 2025
SemPT: Semantic Prompt Tuning for Vision-Language Models
Xiao Shi
Yangjun Ou
Zhenzhong Chen
MLLM
VLM
VPVLM
212
0
0
14 Aug 2025
Seeing, Listening, Remembering, and Reasoning: A Multimodal Agent with Long-Term Memory
Lin Long
Yexiao He
Wentao Ye
Yiyuan Pan
Yuan Lin
Hang Li
Junbo Zhao
Wei Li
298
7
0
13 Aug 2025
Stand-In: A Lightweight and Plug-and-Play Identity Control for Video Generation
Bowen Xue
Zheng-Peng Duan
Qixin Yan
Wenjing Wang
Hao Liu
Chun-Le Guo
Chongyi Li
Chen Li
Jing Lyu
DiffM
VGen
123
4
0
11 Aug 2025
QA-Dragon: Query-Aware Dynamic RAG System for Knowledge-Intensive Visual Question Answering
Zhuohang Jiang
Pangjing Wu
Xu Yuan
Wenqi Fan
Qing Li
36
0
0
07 Aug 2025
mKG-RAG: Multimodal Knowledge Graph-Enhanced RAG for Visual Question Answering
Xu Yuan
Liangbo Ning
Wenqi Fan
Qing Li
146
2
0
07 Aug 2025
IMoRe: Implicit Program-Guided Reasoning for Human Motion Q&A
Chen Li
Chinthani Sugandhika
Yeo Keat Ee
Eric Peh
Hao Zhang
Hong Yang
Deepu Rajan
Basura Fernando
LRM
128
0
0
04 Aug 2025
MagicVL-2B: Empowering Vision-Language Models on Mobile Devices with Lightweight Visual Encoders via Curriculum Learning
Lu Dong
X. Xu
Zeyu Xu
Meng Zhang
Y. Li
...
Jifa Sun
Siling Lin
Shengxun Cheng
L. Zhang
Kang Wang
VLM
88
1
0
03 Aug 2025
Bidirectional Likelihood Estimation with Multi-Modal Large Language Models for Text-Video Retrieval
Dohwan Ko
Ji Soo Lee
M. Choi
Zihang Meng
Hyunwoo J. Kim
288
1
0
31 Jul 2025
MoCHA: Advanced Vision-Language Reasoning with MoE Connector and Hierarchical Group Attention
Yuqi Pang
Bowen Yang
Yun Cao
Fan Rong
Xiaoyu Li
Chen He
VLM
147
0
0
30 Jul 2025
ReGATE: Learning Faster and Better with Fewer Tokens in MLLMs
Chaoyu Li
Yogesh Kulkarni
Pooyan Fazli
135
0
0
29 Jul 2025
Text2VLM: Adapting Text-Only Datasets to Evaluate Alignment Training in Visual Language Models
Gabriel Downer
Sean Craven
Damian Ruck
Jake Thomas
117
1
0
28 Jul 2025
Inducing Causal World Models in LLMs for Zero-Shot Physical Reasoning
Aditya Sharma
Ananya Gupta
Chengyu Wang
Chiamaka Adebayo
Chiamaka Adebayo
LRM
AI4CE
217
1
0
26 Jul 2025
FedVLM: Scalable Personalized Vision-Language Models through Federated Learning
Arkajyoti Mitra
Afia Anjum
Paul Agbaje
Mert D. Pesé
Habeeb Olufowobi
VLM
166
2
0
23 Jul 2025
ThinkAct: Vision-Language-Action Reasoning via Reinforced Visual Latent Planning
Chi-Pin Huang
Yueh-Hua Wu
Min-Hung Chen
Yu-Chun Wang
Fu-En Yang
LM&Ro
LRM
227
38
0
22 Jul 2025
ArtiMuse: Fine-Grained Image Aesthetics Assessment with Joint Scoring and Expert-Level Understanding
Shuo Cao
Nan Ma
Jiayang Li
Xiaohui Li
Lihao Shao
...
Bo Qu
Wenhai Wang
Yu Qiao
Dajuin Yao
Yihao Liu
143
6
0
19 Jul 2025
Scaling Laws for Optimal Data Mixtures
Mustafa Shukor
Louis Béthune
Dan Busbridge
David Grangier
Enrico Fini
Alaaeldin El-Nouby
Pierre Ablin
165
9
0
12 Jul 2025
Advancing Multimodal LLMs by Large-Scale 3D Visual Instruction Dataset Generation
Liu He
Xiao Zeng
Yizhi Song
Albert Y. C. Chen
Lu Xia
Shashwat Verma
Sankalp Dayal
Min Sun
Cheng-Hao Kuo
Daniel G. Aliaga
VGen
214
0
0
11 Jul 2025
Scaling RL to Long Videos
Yukang Chen
Wei Huang
Baifeng Shi
Qinghao Hu
Hanrong Ye
...
Xiaojuan Qi
Sifei Liu
Hongxu Yin
Yao Lu
Song Han
OffRL
AI4TS
VLM
LRM
326
34
0
10 Jul 2025
Omni-Video: Democratizing Unified Video Understanding and Generation
Zhiyu Tan
Hao Yang
Luozheng Qin
Jia Gong
Mengping Yang
Hao Li
VGen
VLM
352
10
0
08 Jul 2025
Q-Frame: Query-aware Frame Selection and Multi-Resolution Adaptation for Video-LLMs
Shaojie Zhang
Jiahui Yang
Jianqin Yin
Zhenbo Luo
Jian Luan
288
19
0
27 Jun 2025
Co-VisiON: Co-Visibility ReasONing on Sparse Image Sets of Indoor Scenes
Chao-Yeh Chen
Nobel Dang
Juexiao Zhang
Wenkai Sun
Pengfei Zheng
Xuhang He
Yimeng Ye
Taarun Srinivas
Taarun Srinivas
Chen Feng
3DV
305
0
0
20 Jun 2025
GRPO-CARE: Consistency-Aware Reinforcement Learning for Multimodal Reasoning
Yi Chen
Yuying Ge
Rui Wang
Yixiao Ge
Junhao Cheng
Mingyu Ding
Xihui Liu
OffRL
VLM
LRM
127
21
0
19 Jun 2025
GenRecal: Generation after Recalibration from Large to Small Vision-Language Models
Byung-Kwan Lee
Ryo Hachiuma
Yong Man Ro
Yu-Chun Wang
Yueh-Hua Wu
VLM
279
2
0
18 Jun 2025
video-SALMONN 2: Caption-Enhanced Audio-Visual Large Language Models
Changli Tang
Yixuan Li
Yudong Yang
Jimin Zhuang
Guangzhi Sun
Wei Li
Zejun Ma
Chao Zhang
291
2
0
18 Jun 2025
Show-o2: Improved Native Unified Multimodal Models
Jinheng Xie
Zhenheng Yang
Mike Zheng Shou
VGen
395
81
0
18 Jun 2025
SmartHome-Bench: A Comprehensive Benchmark for Video Anomaly Detection in Smart Homes Using Multi-Modal Large Language Models
Xinyi Zhao
Congjing Zhang
Pei Guo
Wei Li
Lin Chen
Chaoyue Zhao
Shuai Huang
175
1
0
15 Jun 2025
How Visual Representations Map to Language Feature Space in Multimodal LLMs
Constantin Venhoff
Ashkan Khakzar
Sonia Joseph
Juil Sock
Neel Nanda
222
8
0
13 Jun 2025
VRBench: A Benchmark for Multi-Step Reasoning in Long Narrative Videos
Jiashuo Yu
Y. Wu
Meng Chu
Zhifei Ren
Z. Huang
...
Conghui He
Yu Qiao
Yali Wang
Yi Wang
L. Wang
LRM
391
4
0
12 Jun 2025
DeepVideo-R1: Video Reinforcement Fine-Tuning via Difficulty-aware Regressive GRPO
Jinyoung Park
Jeehye Na
Jinyoung Kim
H. Kim
OffRL
304
18
0
09 Jun 2025
Unblocking Fine-Grained Evaluation of Detailed Captions: An Explaining AutoRater and Critic-and-Revise Pipeline
Brian Gordon
Yonatan Bitton
Andreea Marzoca
Yasumasa Onoe
Xiao Wang
Daniel Cohen-Or
Idan Szpektor
CoGe
161
0
0
09 Jun 2025
CoMemo: LVLMs Need Image Context with Image Memory
Shi-Qi Liu
Weijie Su
Xizhou Zhu
Wenhai Wang
Jifeng Dai
VLM
169
0
0
06 Jun 2025
Struct2D: A Perception-Guided Framework for Spatial Reasoning in MLLMs
Fangrui Zhu
Hanhui Wang
Yiming Xie
Jing Gu
Tianye Ding
Jianwei Yang
Huaizu Jiang
3DV
LRM
396
0
0
04 Jun 2025
Mitigating Hallucinations in Large Vision-Language Models via Entity-Centric Multimodal Preference Optimization
Jiulong Wu
Zhengliang Shi
Shuaiqiang Wang
J. Huang
Dawei Yin
Lingyong Yan
Min Cao
Min Zhang
MLLM
265
1
0
04 Jun 2025
Generating 6DoF Object Manipulation Trajectories from Action Description in Egocentric Vision
Computer Vision and Pattern Recognition (CVPR), 2025
Tomoya Yoshida
Shuhei Kurita
Taichi Nishimura
Shinsuke Mori
264
1
0
04 Jun 2025
METok: Multi-Stage Event-based Token Compression for Efficient Long Video Understanding
Mengyue Wang
Shuo Chen
Kristian Kersting
Volker Tresp
Yunpu Ma
VLM
187
1
0
03 Jun 2025
Previous
1
2
3
4
5
6
Next