Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
All Papers
0 / 0 papers shown
Home
Papers
2312.07533
Cited By
v1
v2
v3
v4 (latest)
VILA: On Pre-training for Visual Language Models
Computer Vision and Pattern Recognition (CVPR), 2023
12 December 2023
Ji Lin
Hongxu Yin
Ming-Yu Liu
Yao Lu
Pavlo Molchanov
Andrew Tao
Huizi Mao
Jan Kautz
Mohammad Shoeybi
Song Han
MLLM
VLM
Re-assign community
ArXiv (abs)
PDF
HTML
HuggingFace (23 upvotes)
Papers citing
"VILA: On Pre-training for Visual Language Models"
50 / 280 papers shown
CartoMapQA: A Fundamental Benchmark Dataset Evaluating Vision-Language Models on Cartographic Map Understanding
H. Ung
Guillaume Habault
Yasutaka Nishimura
Hao Niu
Roberto Legaspi
...
Ryoichi Kojima
Masato Taya
Chihiro Ono
A. Minamikawa
Y. Liu
104
0
0
03 Dec 2025
SpaceTools: Tool-Augmented Spatial Reasoning via Double Interactive RL
Siyi Chen
Mikaela Angelina Uy
Chan Hee Song
Faisal Ladhak
Adithyavairavan Murali
Qing Qu
Stan Birchfield
Valts Blukis
Jonathan Tremblay
OffRL
LRM
138
0
0
03 Dec 2025
Describe Anything Anywhere At Any Moment
Nicolas Gorlo
Lukas Schmid
Luca Carlone
3DV
VLM
350
0
0
29 Nov 2025
SFA: Scan, Focus, and Amplify toward Guidance-aware Answering for Video TextVQA
Haibin He
Qihuang Zhong
Juhua Liu
Bo Du
Peng Wang
Jing Zhang
108
0
0
25 Nov 2025
Vision-Language Memory for Spatial Reasoning
Zuntao Liu
Yi Du
Taimeng Fu
Shaoshu Su
Cherie Ho
Chen Wang
VLM
LRM
249
0
0
25 Nov 2025
Growing with the Generator: Self-paced GRPO for Video Generation
Rui Li
Yuanzhi Liang
Ziqi Ni
H. Huang
Chi Zhang
Xuelong Li
EGVM
VGen
120
0
0
24 Nov 2025
LAST: LeArning to Think in Space and Time for Generalist Vision-Language Models
Shuai Wang
D. Zhang
Tianyi Bai
Shitong Shao
Jiebo Luo
Jiaheng Wei
VLM
138
1
0
24 Nov 2025
Discover, Learn, and Reinforce: Scaling Vision-Language-Action Pretraining with Diverse RL-Generated Trajectories
Rushuai Yang
Zhiyuan Feng
Tianxiang Zhang
Kaixin Wang
Chuheng Zhang
Li Zhao
Xiu Su
Yi-Ling Chen
Jiang Bian
OffRL
205
0
0
24 Nov 2025
SineProject: Machine Unlearning for Stable Vision Language Alignment
Arpit Garg
Hemanth Saratchandran
Simon Lucey
MU
221
0
0
23 Nov 2025
Insight-A: Attribution-aware for Multimodal Misinformation Detection
Junjie Wu
Yumeng Fu
Chen Gong
Guohong Fu
40
0
0
17 Nov 2025
Multimodal LLMs Do Not Compose Skills Optimally Across Modalities
Paula Ontalvilla
Aitor Ormazabal
Gorka Azkune
129
0
0
11 Nov 2025
SpatialThinker: Reinforcing 3D Reasoning in Multimodal LLMs via Spatial Rewards
Hunar Batra
Haoqin Tu
Hardy Chen
Yuanze Lin
Cihang Xie
Ronald Clark
OffRL
ReLM
LRM
359
0
0
10 Nov 2025
LiveStar: Live Streaming Assistant for Real-World Online Video Understanding
Zhenyu Yang
Kairui Zhang
Yuhang Hu
Bing Wang
Shengsheng Qian
Bin Wen
Fan Yang
Tingting Gao
Weiming Dong
Changsheng Xu
OffRL
AI4TS
VLM
260
0
0
07 Nov 2025
What do vision-language models see in the context? Investigating multimodal in-context learning
G. O. D. Santos
Esther Colombini
Sandra Avila
102
0
0
28 Oct 2025
PixelRefer: A Unified Framework for Spatio-Temporal Object Referring with Arbitrary Granularity
Yuqian Yuan
W. Zhang
Xin Li
Shihao Wang
Kehan Li
Wentong Li
Jun Xiao
Lei Zhang
Beng Chin Ooi
ObjD
362
0
0
27 Oct 2025
STATUS Bench: A Rigorous Benchmark for Evaluating Object State Understanding in Vision-Language Models
Mahiro Ukai
Shuhei Kurita
Nakamasa Inoue
CoGe
233
0
0
26 Oct 2025
Visual Attention Reasoning via Hierarchical Search and Self-Verification
Wei Cai
Jian Zhao
Yuchen Yuan
T. Zhang
Ming Zhu
Haichuan Tang
Chi Zhang
OffRL
LRM
160
0
0
21 Oct 2025
Exploring a Unified Vision-Centric Contrastive Alternatives on Multi-Modal Web Documents
Yiqi Lin
Alex Jinpeng Wang
Linjie Li
Zhengyuan Yang
Mike Zheng Shou
132
1
0
21 Oct 2025
Input Domain Aware MoE: Decoupling Routing Decisions from Task Optimization in Mixture of Experts
Yongxiang Hua
H. Cao
Zhou Tao
Bocheng Li
Zihao Wu
Chaohu Liu
Linli Xu
MoE
212
0
0
18 Oct 2025
OmniVinci: Enhancing Architecture and Data for Omni-Modal Understanding LLM
Hanrong Ye
Chao-Han Huck Yang
Arushi Goel
Wei Huang
Ligeng Zhu
...
Andrew Tao
Song Han
Jan Kautz
Hongxu Yin
Pavlo Molchanov
174
3
0
17 Oct 2025
Efficient Video Sampling: Pruning Temporally Redundant Tokens for Faster VLM Inference
Natan Bagrov
Eugene Khvedchenia
Borys Tymchenko
Shay Aharon
Lior Kadoch
...
Yonatan Geifman
Ran Zilberstein
Tuomas Rintamaki
Matthieu Le
Andrew Tao
VLM
128
1
0
16 Oct 2025
Train a Unified Multimodal Data Quality Classifier with Synthetic Data
Weizhi Wang
Rongmei Lin
Shiyang Li
Colin Lockard
Ritesh Sarkhel
Sanket Lokegaonkar
Jingbo Shang
Xifeng Yan
Nasser Zalmout
Xian Li
92
0
0
16 Oct 2025
UniME-V2: MLLM-as-a-Judge for Universal Multimodal Embedding Learning
Tiancheng Gu
Kaicheng Yang
Kaichen Zhang
Xiang An
Ziyong Feng
Y. Zhang
Weidong Cai
Jiankang Deng
Lidong Bing
211
5
0
15 Oct 2025
CompoDistill: Attention Distillation for Compositional Reasoning in Multimodal LLMs
Jiwan Kim
Kibum Kim
Sangwoo Seo
Chanyoung Park
VLM
144
1
0
14 Oct 2025
DeepMMSearch-R1: Empowering Multimodal LLMs in Multimodal Web Search
Kartik Narayan
Yang Xu
Tian Cao
Kavya Nerella
Vishal M. Patel
Navid Shiee
Peter Grasch
Chao Jia
Yinfei Yang
Zhe Gan
ObjD
KELM
VLM
256
4
0
14 Oct 2025
video-SALMONN S: Streaming Audio-Visual LLMs Beyond Length Limits via Memory
Guangzhi Sun
Yixuan Li
Xiaodong Wu
Yudong Yang
Wei Li
Zejun Ma
Chao Zhang
84
1
0
13 Oct 2025
UniFlow: A Unified Pixel Flow Tokenizer for Visual Understanding and Generation
Zhengrong Yue
H. Zhang
Xiangyu Zeng
Boyu Chen
Chenting Wang
...
Lu Dong
Kunpeng Du
Yi Wang
Limin Wang
Yali Wang
180
7
0
12 Oct 2025
Don't Run with Scissors: Pruning Breaks VLA Models but They Can Be Recovered
Jason J. Jabbour
Dong-Ki Kim
Max Smith
Jay Patrikar
Radhika Ghosal
Youhui Wang
Ali Agha
Vijay Janapa Reddi
Shayegan Omidshafiei
VLM
132
1
0
09 Oct 2025
Vision-Language-Action Models for Robotics: A Review Towards Real-World Applications
IEEE Access (IEEE Access), 2025
Kento Kawaharazuka
Jihoon Oh
Jun Yamada
Ingmar Posner
Yuke Zhu
LM&Ro
259
24
0
08 Oct 2025
Automated Repeatable Adversary Threat Emulation with Effects Language (EL)
Suresh Damodaran
Paul D. Rowe
AAML
132
9
0
07 Oct 2025
A.I.R.: Enabling Adaptive, Iterative, and Reasoning-based Frame Selection For Video Question Answering
Yuanhao Zou
Shengji Jin
Andong Deng
Youpeng Zhao
Jun Wang
Chen Chen
104
0
0
06 Oct 2025
FrameOracle: Learning What to See and How Much to See in Videos
Chaoyu Li
Tianzhi Li
Fei Tao
Zhenyu Zhao
Ziqian Wu
Maozheng Zhao
Juntong Song
Cheng Niu
Pooyan Fazli
VLM
120
0
0
04 Oct 2025
Embracing Evolution: A Call for Body-Control Co-Design in Embodied Humanoid Robot
Guiliang Liu
Bo Yue
Yi Jin Kim
Kui Jia
136
1
0
03 Oct 2025
LMOD+: A Comprehensive Multimodal Dataset and Benchmark for Developing and Evaluating Multimodal Large Language Models in Ophthalmology
Zhenyue Qin
Yang Liu
Yu Yin
Jinyu Ding
H. Zhang
...
Zhiyong Lu
Yih-Chung Tham
Ninghao Liu
Xiuzhen Zhang
Qingyu Chen
84
0
0
30 Sep 2025
VELA: An LLM-Hybrid-as-a-Judge Approach for Evaluating Long Image Captions
Kazuki Matsuda
Yuiga Wada
Shinnosuke Hirano
Seitaro Otsuki
Komei Sugiura
VLM
152
1
0
30 Sep 2025
Learning to See Before Seeing: Demystifying LLM Visual Priors from Language Pre-training
Junlin Han
Shengbang Tong
David Fan
Yufan Ren
Koustuv Sinha
Juil Sock
Filippos Kokkinos
LRM
VLM
191
6
0
30 Sep 2025
NeMo: Needle in a Montage for Video-Language Understanding
Zi-Yuan Hu
Shuo Liang
Duo Zheng
Yanyang Li
Yeyao Tao
...
Jianguang Yu
Jing-ling Huang
Meng Fang
Yin Li
Liwei Wang
161
2
0
29 Sep 2025
Vid-LLM: A Compact Video-based 3D Multimodal LLM with Reconstruction-Reasoning Synergy
Haijier Chen
Bo Xu
Shoujian Zhang
Haoze Liu
Jiaxuan Lin
Jingrong Wang
LRM
146
1
0
29 Sep 2025
Euclid's Gift: Enhancing Spatial Perception and Reasoning in Vision-Language Models via Geometric Surrogate Tasks
Shijie Lian
Changti Wu
L. Yang
Hang Yuan
Bin Yu
Lei Zhang
Kai Chen
LRM
231
1
0
29 Sep 2025
Estimating the Empowerment of Language Model Agents
Jinyeop Song
Jeff Gore
Max Kleiman-Weiner
134
1
0
26 Sep 2025
InfiMed-Foundation: Pioneering Advanced Multimodal Medical Models with Compute-Efficient Pre-Training and Multi-Stage Fine-Tuning
Guanghao Zhu
Zhitian Hou
Zeyu Liu
Zhijie Sang
C. Xie
Hongxia Yang
LM&MA
MedIm
185
0
0
26 Sep 2025
GaussianVision: Vision-Language Alignment from Compressed Image Representations using 2D Gaussian Splatting
Yasmine Omri
Connor Ding
Tsachy Weissman
Thierry Tambe
3DGS
VLM
282
0
0
26 Sep 2025
Meta-Memory: Retrieving and Integrating Semantic-Spatial Memories for Robot Spatial Reasoning
Yufan Mao
Hanjing Ye
Wenlong Dong
Chengjie Zhang
Hong Zhang
LM&Ro
120
0
0
25 Sep 2025
MAPO: Mixed Advantage Policy Optimization
Wenke Huang
Quan Zhang
Yiyang Fang
Jian Liang
Xuankun Rong
...
Mingjun Li
Leszek Rutkowski
Mang Ye
Bo Du
Dacheng Tao
235
4
0
23 Sep 2025
VIR-Bench: Evaluating Geospatial and Temporal Understanding of MLLMs via Travel Video Itinerary Reconstruction
Hao Wang
Eiki Murata
Lingfang Zhang
Ayako Sato
So Fukuda
...
Sebastian Zwirner
Yi-Chia Chen
Hiroyuki Otomo
Hiroki Ouchi
Daisuke Kawahara
134
0
0
23 Sep 2025
OmniVLA: An Omni-Modal Vision-Language-Action Model for Robot Navigation
Noriaki Hirose
Catherine Glossop
Dhruv Shah
Sergey Levine
LM&Ro
188
3
0
23 Sep 2025
PEEK: Guiding and Minimal Image Representations for Zero-Shot Generalization of Robot Manipulation Policies
Jesse Zhang
Marius Memmel
Kevin Kim
Dieter Fox
Jesse Thomason
Fabio Ramos
Erdem Bıyık
Abhishek Gupta
Anqi Li
LM&Ro
125
1
0
22 Sep 2025
MANZANO: A Simple and Scalable Unified Multimodal Model with a Hybrid Vision Tokenizer
Yanghao Li
Rui Qian
Bowen Pan
Haotian Zhang
Haoshuo Huang
...
Zhengdong Zhang
Chen Chen
Yang Zhao
Ruoming Pang
Zhifeng Chen
MLLM
204
4
0
19 Sep 2025
Embodied Arena: A Comprehensive, Unified, and Evolving Evaluation Platform for Embodied AI
Fei Ni
Min Zhang
Pengyi Li
Yifu Yuan
Lingfeng Zhang
...
Yuzheng Zhuang
Yingxue Zhang
Yan Zheng
Hongyao Tang
Jianye Hao
ELM
194
1
0
18 Sep 2025
3D Aware Region Prompted Vision Language Model
A. Cheng
Yang Fu
Yukang Chen
Zhijian Liu
X. Li
...
Jan Kautz
Pavlo Molchanov
Hongxu Yin
Xiaolong Wang
Sifei Liu
139
8
0
16 Sep 2025
1
2
3
4
5
6
Next