ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2312.07533
  4. Cited By
VILA: On Pre-training for Visual Language Models
v1v2v3v4 (latest)

VILA: On Pre-training for Visual Language Models

Computer Vision and Pattern Recognition (CVPR), 2023
12 December 2023
Ji Lin
Hongxu Yin
Ming-Yu Liu
Yao Lu
Pavlo Molchanov
Andrew Tao
Huizi Mao
Jan Kautz
Mohammad Shoeybi
Song Han
    MLLMVLM
ArXiv (abs)PDFHTMLHuggingFace (23 upvotes)

Papers citing "VILA: On Pre-training for Visual Language Models"

50 / 273 papers shown
Title
SFA: Scan, Focus, and Amplify toward Guidance-aware Answering for Video TextVQA
SFA: Scan, Focus, and Amplify toward Guidance-aware Answering for Video TextVQA
Haibin He
Qihuang Zhong
Juhua Liu
Bo Du
Peng Wang
Jing Zhang
68
0
0
25 Nov 2025
Growing with the Generator: Self-paced GRPO for Video Generation
Growing with the Generator: Self-paced GRPO for Video Generation
Rui Li
Yuanzhi Liang
Ziqi Ni
H. Huang
Chi Zhang
Xuelong Li
EGVMVGen
84
0
0
24 Nov 2025
LAST: LeArning to Think in Space and Time for Generalist Vision-Language Models
LAST: LeArning to Think in Space and Time for Generalist Vision-Language Models
Shuai Wang
D. Zhang
Tianyi Bai
Shitong Shao
Jiebo Luo
Jiaheng Wei
VLM
108
0
0
24 Nov 2025
Discover, Learn, and Reinforce: Scaling Vision-Language-Action Pretraining with Diverse RL-Generated Trajectories
Discover, Learn, and Reinforce: Scaling Vision-Language-Action Pretraining with Diverse RL-Generated Trajectories
Rushuai Yang
Zhiyuan Feng
Tianxiang Zhang
Kaixin Wang
Chuheng Zhang
Li Zhao
Xiu Su
Yi-Ling Chen
Jiang Bian
OffRL
173
0
0
24 Nov 2025
SineProject: Machine Unlearning for Stable Vision Language Alignment
SineProject: Machine Unlearning for Stable Vision Language Alignment
Arpit Garg
Hemanth Saratchandran
Simon Lucey
MU
165
0
0
23 Nov 2025
Multimodal LLMs Do Not Compose Skills Optimally Across Modalities
Multimodal LLMs Do Not Compose Skills Optimally Across Modalities
Paula Ontalvilla
Aitor Ormazabal
Gorka Azkune
93
0
0
11 Nov 2025
SpatialThinker: Reinforcing 3D Reasoning in Multimodal LLMs via Spatial Rewards
SpatialThinker: Reinforcing 3D Reasoning in Multimodal LLMs via Spatial Rewards
Hunar Batra
Haoqin Tu
Hardy Chen
Yuanze Lin
Cihang Xie
Ronald Clark
OffRLReLMLRM
283
0
0
10 Nov 2025
LiveStar: Live Streaming Assistant for Real-World Online Video Understanding
LiveStar: Live Streaming Assistant for Real-World Online Video Understanding
Zhenyu Yang
Kairui Zhang
Yuhang Hu
Bing Wang
Shengsheng Qian
Bin Wen
Fan Yang
Tingting Gao
Weiming Dong
Changsheng Xu
OffRLAI4TSVLM
200
0
0
07 Nov 2025
What do vision-language models see in the context? Investigating multimodal in-context learning
What do vision-language models see in the context? Investigating multimodal in-context learning
G. O. D. Santos
Esther Colombini
Sandra Avila
68
0
0
28 Oct 2025
PixelRefer: A Unified Framework for Spatio-Temporal Object Referring with Arbitrary Granularity
PixelRefer: A Unified Framework for Spatio-Temporal Object Referring with Arbitrary Granularity
Yuqian Yuan
W. Zhang
Xin Li
Shihao Wang
Kehan Li
Wentong Li
Jun Xiao
Lei Zhang
Beng Chin Ooi
ObjD
294
0
0
27 Oct 2025
STATUS Bench: A Rigorous Benchmark for Evaluating Object State Understanding in Vision-Language Models
STATUS Bench: A Rigorous Benchmark for Evaluating Object State Understanding in Vision-Language Models
Mahiro Ukai
Shuhei Kurita
Nakamasa Inoue
CoGe
189
0
0
26 Oct 2025
VAR: Visual Attention Reasoning via Structured Search and Backtracking
VAR: Visual Attention Reasoning via Structured Search and Backtracking
Wei Cai
Jian Zhao
Yuchen Yuan
T. Zhang
Ming Zhu
Haichuan Tang
Chi Zhang
Xuelong Li
OffRLLRM
92
0
0
21 Oct 2025
Exploring a Unified Vision-Centric Contrastive Alternatives on Multi-Modal Web Documents
Exploring a Unified Vision-Centric Contrastive Alternatives on Multi-Modal Web Documents
Yiqi Lin
Alex Jinpeng Wang
Linjie Li
Zhengyuan Yang
Mike Zheng Shou
100
0
0
21 Oct 2025
Input Domain Aware MoE: Decoupling Routing Decisions from Task Optimization in Mixture of Experts
Input Domain Aware MoE: Decoupling Routing Decisions from Task Optimization in Mixture of Experts
Yongxiang Hua
H. Cao
Zhou Tao
Bocheng Li
Zihao Wu
Chaohu Liu
Linli Xu
MoE
148
0
0
18 Oct 2025
OmniVinci: Enhancing Architecture and Data for Omni-Modal Understanding LLM
OmniVinci: Enhancing Architecture and Data for Omni-Modal Understanding LLM
Hanrong Ye
Chao-Han Huck Yang
Arushi Goel
Wei Huang
Ligeng Zhu
...
Andrew Tao
Song Han
Jan Kautz
Hongxu Yin
Pavlo Molchanov
142
3
0
17 Oct 2025
Train a Unified Multimodal Data Quality Classifier with Synthetic Data
Train a Unified Multimodal Data Quality Classifier with Synthetic Data
Weizhi Wang
Rongmei Lin
Shiyang Li
Colin Lockard
Ritesh Sarkhel
Sanket Lokegaonkar
Jingbo Shang
Xifeng Yan
Nasser Zalmout
Xian Li
80
0
0
16 Oct 2025
Efficient Video Sampling: Pruning Temporally Redundant Tokens for Faster VLM Inference
Efficient Video Sampling: Pruning Temporally Redundant Tokens for Faster VLM Inference
Natan Bagrov
Eugene Khvedchenia
Borys Tymchenko
Shay Aharon
Lior Kadoch
...
Yonatan Geifman
Ran Zilberstein
Tuomas Rintamaki
Matthieu Le
Andrew Tao
VLM
100
1
0
16 Oct 2025
UniME-V2: MLLM-as-a-Judge for Universal Multimodal Embedding Learning
UniME-V2: MLLM-as-a-Judge for Universal Multimodal Embedding Learning
Tiancheng Gu
Kaicheng Yang
Kaichen Zhang
Xiang An
Ziyong Feng
Y. Zhang
Weidong Cai
Jiankang Deng
Lidong Bing
165
4
0
15 Oct 2025
DeepMMSearch-R1: Empowering Multimodal LLMs in Multimodal Web Search
DeepMMSearch-R1: Empowering Multimodal LLMs in Multimodal Web Search
Kartik Narayan
Yang Xu
Tian Cao
Kavya Nerella
Vishal M. Patel
Navid Shiee
Peter Grasch
Chao Jia
Yinfei Yang
Zhe Gan
ObjDKELMVLM
216
3
0
14 Oct 2025
CompoDistill: Attention Distillation for Compositional Reasoning in Multimodal LLMs
CompoDistill: Attention Distillation for Compositional Reasoning in Multimodal LLMs
Jiwan Kim
Kibum Kim
Sangwoo Seo
Chanyoung Park
VLM
120
0
0
14 Oct 2025
video-SALMONN S: Streaming Audio-Visual LLMs Beyond Length Limits via Memory
video-SALMONN S: Streaming Audio-Visual LLMs Beyond Length Limits via Memory
Guangzhi Sun
Yixuan Li
Xiaodong Wu
Yudong Yang
Wei Li
Zejun Ma
Chao Zhang
64
1
0
13 Oct 2025
UniFlow: A Unified Pixel Flow Tokenizer for Visual Understanding and Generation
UniFlow: A Unified Pixel Flow Tokenizer for Visual Understanding and Generation
Zhengrong Yue
H. Zhang
Xiangyu Zeng
Boyu Chen
Chenting Wang
...
Lu Dong
Kunpeng Du
Yi Wang
Limin Wang
Yali Wang
148
3
0
12 Oct 2025
Don't Run with Scissors: Pruning Breaks VLA Models but They Can Be Recovered
Don't Run with Scissors: Pruning Breaks VLA Models but They Can Be Recovered
Jason J. Jabbour
Dong-Ki Kim
Max Smith
Jay Patrikar
Radhika Ghosal
Youhui Wang
Ali Agha
Vijay Janapa Reddi
Shayegan Omidshafiei
VLM
108
1
0
09 Oct 2025
Vision-Language-Action Models for Robotics: A Review Towards Real-World Applications
Vision-Language-Action Models for Robotics: A Review Towards Real-World ApplicationsIEEE Access (IEEE Access), 2025
Kento Kawaharazuka
Jihoon Oh
Jun Yamada
Ingmar Posner
Yuke Zhu
LM&Ro
203
19
0
08 Oct 2025
Automated Repeatable Adversary Threat Emulation with Effects Language (EL)
Automated Repeatable Adversary Threat Emulation with Effects Language (EL)
Suresh Damodaran
Paul D. Rowe
AAML
84
0
0
07 Oct 2025
A.I.R.: Enabling Adaptive, Iterative, and Reasoning-based Frame Selection For Video Question Answering
A.I.R.: Enabling Adaptive, Iterative, and Reasoning-based Frame Selection For Video Question Answering
Yuanhao Zou
Shengji Jin
Andong Deng
Youpeng Zhao
Jun Wang
Chen Chen
64
0
0
06 Oct 2025
FrameOracle: Learning What to See and How Much to See in Videos
FrameOracle: Learning What to See and How Much to See in Videos
Chaoyu Li
Tianzhi Li
Fei Tao
Zhenyu Zhao
Ziqian Wu
Maozheng Zhao
Juntong Song
Cheng Niu
Pooyan Fazli
VLM
76
0
0
04 Oct 2025
Embracing Evolution: A Call for Body-Control Co-Design in Embodied Humanoid Robot
Embracing Evolution: A Call for Body-Control Co-Design in Embodied Humanoid Robot
Guiliang Liu
Bo Yue
Yi Jin Kim
Kui Jia
108
1
0
03 Oct 2025
VELA: An LLM-Hybrid-as-a-Judge Approach for Evaluating Long Image Captions
VELA: An LLM-Hybrid-as-a-Judge Approach for Evaluating Long Image Captions
Kazuki Matsuda
Yuiga Wada
Shinnosuke Hirano
Seitaro Otsuki
Komei Sugiura
VLM
116
0
0
30 Sep 2025
LMOD+: A Comprehensive Multimodal Dataset and Benchmark for Developing and Evaluating Multimodal Large Language Models in Ophthalmology
LMOD+: A Comprehensive Multimodal Dataset and Benchmark for Developing and Evaluating Multimodal Large Language Models in Ophthalmology
Zhenyue Qin
Yang Liu
Yu Yin
Jinyu Ding
H. Zhang
...
Zhiyong Lu
Yih-Chung Tham
Ninghao Liu
Xiuzhen Zhang
Qingyu Chen
68
0
0
30 Sep 2025
Learning to See Before Seeing: Demystifying LLM Visual Priors from Language Pre-training
Learning to See Before Seeing: Demystifying LLM Visual Priors from Language Pre-training
Junlin Han
Shengbang Tong
David Fan
Yufan Ren
Koustuv Sinha
Juil Sock
Filippos Kokkinos
LRMVLM
139
4
0
30 Sep 2025
NeMo: Needle in a Montage for Video-Language Understanding
NeMo: Needle in a Montage for Video-Language Understanding
Zi-Yuan Hu
Shuo Liang
Duo Zheng
Yanyang Li
Yeyao Tao
...
Jianguang Yu
Jing-ling Huang
Meng Fang
Yin Li
Liwei Wang
113
1
0
29 Sep 2025
Vid-LLM: A Compact Video-based 3D Multimodal LLM with Reconstruction-Reasoning Synergy
Vid-LLM: A Compact Video-based 3D Multimodal LLM with Reconstruction-Reasoning Synergy
Haijier Chen
Bo Xu
Shoujian Zhang
Haoze Liu
Jiaxuan Lin
Jingrong Wang
LRM
106
1
0
29 Sep 2025
Euclid's Gift: Enhancing Spatial Perception and Reasoning in Vision-Language Models via Geometric Surrogate Tasks
Euclid's Gift: Enhancing Spatial Perception and Reasoning in Vision-Language Models via Geometric Surrogate Tasks
Shijie Lian
Changti Wu
L. Yang
Hang Yuan
Bin Yu
Lei Zhang
Kai Chen
LRM
159
1
0
29 Sep 2025
Vision-Language Alignment from Compressed Image Representations using 2D Gaussian Splatting
Vision-Language Alignment from Compressed Image Representations using 2D Gaussian Splatting
Yasmine Omri
Connor Ding
Tsachy Weissman
Thierry Tambe
3DGSVLM
110
0
0
26 Sep 2025
Estimating the Empowerment of Language Model Agents
Estimating the Empowerment of Language Model Agents
Jinyeop Song
Jeff Gore
Max Kleiman-Weiner
106
1
0
26 Sep 2025
InfiMed-Foundation: Pioneering Advanced Multimodal Medical Models with Compute-Efficient Pre-Training and Multi-Stage Fine-Tuning
InfiMed-Foundation: Pioneering Advanced Multimodal Medical Models with Compute-Efficient Pre-Training and Multi-Stage Fine-Tuning
Guanghao Zhu
Zhitian Hou
Zeyu Liu
Zhijie Sang
C. Xie
Hongxia Yang
LM&MAMedIm
145
0
0
26 Sep 2025
Meta-Memory: Retrieving and Integrating Semantic-Spatial Memories for Robot Spatial Reasoning
Meta-Memory: Retrieving and Integrating Semantic-Spatial Memories for Robot Spatial Reasoning
Yufan Mao
Hanjing Ye
Wenlong Dong
Chengjie Zhang
Hong Zhang
LM&Ro
44
0
0
25 Sep 2025
OmniVLA: An Omni-Modal Vision-Language-Action Model for Robot Navigation
OmniVLA: An Omni-Modal Vision-Language-Action Model for Robot Navigation
Noriaki Hirose
Catherine Glossop
Dhruv Shah
Sergey Levine
LM&Ro
160
2
0
23 Sep 2025
VIR-Bench: Evaluating Geospatial and Temporal Understanding of MLLMs via Travel Video Itinerary Reconstruction
VIR-Bench: Evaluating Geospatial and Temporal Understanding of MLLMs via Travel Video Itinerary Reconstruction
Hao Wang
Eiki Murata
Lingfang Zhang
Ayako Sato
So Fukuda
...
Sebastian Zwirner
Yi-Chia Chen
Hiroyuki Otomo
Hiroki Ouchi
Daisuke Kawahara
82
0
0
23 Sep 2025
MAPO: Mixed Advantage Policy Optimization
MAPO: Mixed Advantage Policy Optimization
Wenke Huang
Quan Zhang
Yiyang Fang
Jian Liang
Xuankun Rong
...
Mingjun Li
Leszek Rutkowski
Mang Ye
Bo Du
Dacheng Tao
155
4
0
23 Sep 2025
PEEK: Guiding and Minimal Image Representations for Zero-Shot Generalization of Robot Manipulation Policies
PEEK: Guiding and Minimal Image Representations for Zero-Shot Generalization of Robot Manipulation Policies
Jesse Zhang
Marius Memmel
Kevin Kim
Dieter Fox
Jesse Thomason
Fabio Ramos
Erdem Bıyık
Abhishek Gupta
Anqi Li
LM&Ro
81
1
0
22 Sep 2025
MANZANO: A Simple and Scalable Unified Multimodal Model with a Hybrid Vision Tokenizer
MANZANO: A Simple and Scalable Unified Multimodal Model with a Hybrid Vision Tokenizer
Yanghao Li
Rui Qian
Bowen Pan
Haotian Zhang
Haoshuo Huang
...
Zhengdong Zhang
Chen Chen
Yang Zhao
Ruoming Pang
Zhifeng Chen
MLLM
184
4
0
19 Sep 2025
Embodied Arena: A Comprehensive, Unified, and Evolving Evaluation Platform for Embodied AI
Embodied Arena: A Comprehensive, Unified, and Evolving Evaluation Platform for Embodied AI
Fei Ni
Min Zhang
Pengyi Li
Yifu Yuan
Lingfeng Zhang
...
Yuzheng Zhuang
Yingxue Zhang
Yan Zheng
Hongyao Tang
Jianye Hao
ELM
134
1
0
18 Sep 2025
3D Aware Region Prompted Vision Language Model
3D Aware Region Prompted Vision Language Model
A. Cheng
Yang Fu
Yukang Chen
Zhijian Liu
X. Li
...
Jan Kautz
Pavlo Molchanov
Hongxu Yin
Xiaolong Wang
Sifei Liu
103
6
0
16 Sep 2025
When Safe Unimodal Inputs Collide: Optimizing Reasoning Chains for Cross-Modal Safety in Multimodal Large Language Models
When Safe Unimodal Inputs Collide: Optimizing Reasoning Chains for Cross-Modal Safety in Multimodal Large Language Models
Wei Cai
Shujuan Liu
Jian Zhao
Ziyan Shi
Yusheng Zhao
Yuchen Yuan
Tianle Zhang
Chi Zhang
Xuelong Li
LRM
167
3
0
15 Sep 2025
CoachMe: Decoding Sport Elements with a Reference-Based Coaching Instruction Generation Model
CoachMe: Decoding Sport Elements with a Reference-Based Coaching Instruction Generation ModelAnnual Meeting of the Association for Computational Linguistics (ACL), 2025
Wei-Hsin Yeh
Yu-An Su
Chih-Ning Chen
Yi-Hsueh Lin
Calvin Ku
Wen-Hsin Chiu
Min-Chun Hu
Lun-Wei Ku
84
0
0
15 Sep 2025
Traffic-MLLM: A Spatio-Temporal MLLM with Retrieval-Augmented Generation for Causal Inference in Traffic
Traffic-MLLM: A Spatio-Temporal MLLM with Retrieval-Augmented Generation for Causal Inference in Traffic
Waikit Xiu
Qiang Lu
Xiying Li
Chen Hu
Shengbo Sun
LRM
56
0
0
14 Sep 2025
Enhancing Generalization in Vision-Language-Action Models by Preserving Pretrained Representations
Enhancing Generalization in Vision-Language-Action Models by Preserving Pretrained Representations
Shresth Grover
Akshay Gopalkrishnan
Bo Ai
Henrik I. Christensen
H. Su
Xuanlin Li
VLM
173
4
0
14 Sep 2025
Measuring Epistemic Humility in Multimodal Large Language Models
Measuring Epistemic Humility in Multimodal Large Language Models
Bingkui Tong
Jiaer Xia
Sifeng Shang
Kaiyang Zhou
HILM
100
2
0
11 Sep 2025
123456
Next