Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
2312.07533
Cited By
v1
v2
v3
v4 (latest)
VILA: On Pre-training for Visual Language Models
Computer Vision and Pattern Recognition (CVPR), 2023
12 December 2023
Ji Lin
Hongxu Yin
Ming-Yu Liu
Yao Lu
Pavlo Molchanov
Andrew Tao
Huizi Mao
Jan Kautz
Mohammad Shoeybi
Song Han
MLLM
VLM
Re-assign community
ArXiv (abs)
PDF
HTML
HuggingFace (23 upvotes)
Papers citing
"VILA: On Pre-training for Visual Language Models"
50 / 275 papers shown
Title
RoboSpatial: Teaching Spatial Understanding to 2D and 3D Vision-Language Models for Robotics
Computer Vision and Pattern Recognition (CVPR), 2024
Chan Hee Song
Valts Blukis
Jonathan Tremblay
Stephen Tyree
Yu-Chuan Su
Stan Birchfield
748
76
0
25 Nov 2024
Human-Activity AGV Quality Assessment: A Benchmark Dataset and an Objective Evaluation Metric
Zhichao Zhang
Wei Sun
Xinyue Li
Yunhao Li
Qihang Ge
...
Zhongpeng Ji
Fengyu Sun
Shangling Jui
Xiongkuo Min
Guoquan Zheng
EGVM
461
10
0
25 Nov 2024
AnyEdit: Mastering Unified High-Quality Image Editing for Any Idea
Computer Vision and Pattern Recognition (CVPR), 2024
Qifan Yu
Wei Chow
Zhongqi Yue
Kaihang Pan
Yang Wu
Xiaoyang Wan
Juncheng Billy Li
Siliang Tang
Hao Zhang
Yueting Zhuang
DiffM
429
106
0
24 Nov 2024
Video-RAG: Visually-aligned Retrieval-Augmented Long Video Comprehension
Yongdong Luo
Xiawu Zheng
Guilin Li
Guilin Li
Haojia Lin
...
Jinfa Huang
Jiayi Ji
Jiebo Luo
Rongrong Ji
Rongrong Ji
VLM
542
67
0
20 Nov 2024
VILA-M3: Enhancing Vision-Language Models with Medical Expert Knowledge
Computer Vision and Pattern Recognition (CVPR), 2024
Vishwesh Nath
Wenqi Li
Dong Yang
Andriy Myronenko
Mingxin Zheng
...
Holger Roth
Daguang Xu
Baris Turkbey
Holger Roth
Daguang Xu
VLM
475
22
0
19 Nov 2024
VidComposition: Can MLLMs Analyze Compositions in Compiled Videos?
Computer Vision and Pattern Recognition (CVPR), 2024
Yunlong Tang
Junjia Guo
Hang Hua
Susan Liang
Mingqian Feng
...
Chao Huang
Jing Bi
Zeliang Zhang
Pooyan Fazli
Chenliang Xu
CoGe
360
16
0
17 Nov 2024
Motion-Grounded Video Reasoning: Understanding and Perceiving Motion at Pixel Level
Computer Vision and Pattern Recognition (CVPR), 2024
Andong Deng
Tongjia Chen
Shoubin Yu
Taojiannan Yang
Lincoln Spencer
Yapeng Tian
Lin Wang
Joey Tianyi Zhou
Chen Chen
LRM
327
9
0
15 Nov 2024
StoryTeller: Improving Long Video Description through Global Audio-Visual Character Identification
Yichen He
Yuan Lin
Jianchao Wu
Hanchong Zhang
Yuchen Zhang
Ruicheng Le
VGen
VLM
698
5
0
11 Nov 2024
Both Text and Images Leaked! A Systematic Analysis of Data Contamination in Multimodal LLM
D. Song
Sicheng Lai
Shunian Chen
Shunian Chen
Lichao Sun
Benyou Wang
959
2
0
06 Nov 2024
Situational Scene Graph for Structured Human-centric Situation Understanding
IEEE Workshop/Winter Conference on Applications of Computer Vision (WACV), 2024
Chinthani Sugandhika
Chen Li
Deepu Rajan
Basura Fernando
953
3
0
30 Oct 2024
Senna: Bridging Large Vision-Language Models and End-to-End Autonomous Driving
Bo Jiang
Shaoyu Chen
Bencheng Liao
Xingyu Zhang
Wei Yin
Qian Zhang
Chang Huang
Wen Liu
Xinyu Wang
VLM
MLLM
LRM
221
68
0
29 Oct 2024
AAAR-1.0: Assessing AI's Potential to Assist Research
Renze Lou
Hanzi Xu
Sijia Wang
Jiangshu Du
Ryo Kamoi
...
Xi Li
Jianchao Tan
Congying Xia
Lifu Huang
Wenpeng Yin
380
12
0
29 Oct 2024
COAT: Compressing Optimizer states and Activation for Memory-Efficient FP8 Training
International Conference on Learning Representations (ICLR), 2024
Haocheng Xi
Han Cai
Ligeng Zhu
Yaojie Lu
Kurt Keutzer
Jianfei Chen
Song Han
MQ
373
16
0
25 Oct 2024
Improving Multimodal Large Language Models Using Continual Learning
Shikhar Srivastava
Md Yousuf Harun
Robik Shrestha
Christopher Kanan
KELM
VLM
CLL
179
1
0
25 Oct 2024
VideoWebArena: Evaluating Long Context Multimodal Agents with Video Understanding Web Tasks
International Conference on Learning Representations (ICLR), 2024
Lawrence Jang
Yinheng Li
Charles Ding
Justin Lin
Paul Pu Liang
Dan Zhao
Rogerio Bonatti
K. Koishida
336
22
0
24 Oct 2024
VoiceTextBlender: Augmenting Large Language Models with Speech Capabilities via Single-Stage Joint Speech-Text Supervised Fine-Tuning
North American Chapter of the Association for Computational Linguistics (NAACL), 2024
Yifan Peng
Krishna Puvvada
Zhehuai Chen
Piotr .Zelasko
He Huang
Kunal Dhawan
Ke Hu
Shinji Watanabe
Jagadeesh Balam
Boris Ginsburg
313
7
0
23 Oct 2024
E3D-GPT: Enhanced 3D Visual Foundation for Medical Vision-Language Model
Zihang Jiang
Zihang Jiang
Qingsong Yao
Rongsheng Wang
Zhiyang He
Xiaodong Tao
Weifu Lv
Weifu Lv
Shuoling Zhou
VLM
MedIm
132
10
0
18 Oct 2024
ViCToR: Improving Visual Comprehension via Token Reconstruction for Pretraining LMMs
Yin Xie
Kaicheng Yang
Ninghua Yang
Weimo Deng
Xiangzi Dai
Tiancheng Gu
Yumeng Wang
Xiang An
Yongle Zhao
Ziyong Feng
MLLM
VLM
312
1
0
18 Oct 2024
DreamVideo-2: Zero-Shot Subject-Driven Video Customization with Precise Motion Control
Yujie Wei
Shiwei Zhang
Hangjie Yuan
Xiang Wang
Haonan Qiu
...
Fan Liu
Zhizhong Huang
Jiaxin Ye
Yingya Zhang
Hongming Shan
DiffM
VGen
287
29
0
17 Oct 2024
MM-Ego: Towards Building Egocentric Multimodal LLMs for Video QA
Hanrong Ye
Haotian Zhang
Erik Daxberger
Lin Chen
Zongyu Lin
...
Haoxuan You
Dan Xu
Zhe Gan
Jiasen Lu
Yinfei Yang
EgoV
MLLM
243
18
0
09 Oct 2024
ActPlan-1K: Benchmarking the Procedural Planning Ability of Visual Language Models in Household Activities
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024
Ying Su
Zhan Ling
Haochen Shi
Jiayang Cheng
Yauwai Yim
Yangqiu Song
LM&Ro
130
8
0
04 Oct 2024
AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark
International Conference on Learning Representations (ICLR), 2024
Wenhao Chai
Enxin Song
Y. Du
Chenlin Meng
Vashisht Madhavan
Omer Bar-Tal
Jeng-Neng Hwang
Saining Xie
Christopher D. Manning
3DV
573
88
0
04 Oct 2024
Frame-Voyager: Learning to Query Frames for Video Large Language Models
International Conference on Learning Representations (ICLR), 2024
Sicheng Yu
Chengkai Jin
Huanyu Wang
Zhenghao Chen
Sheng Jin
...
Zhenbang Sun
Bingni Zhang
Jiawei Wu
Hao Zhang
Qianru Sun
295
37
0
04 Oct 2024
LLaVA-Video: Video Instruction Tuning With Synthetic Data
Yuanhan Zhang
Jinming Wu
W. Li
Bo Li
Zejun Ma
Ziwei Liu
Chunyuan Li
SyDa
VGen
428
363
0
03 Oct 2024
LLaVA-Critic: Learning to Evaluate Multimodal Models
Computer Vision and Pattern Recognition (CVPR), 2024
Tianyi Xiong
Xinze Wang
Dong Guo
Qinghao Ye
Haoqi Fan
Quanquan Gu
Heng Huang
Chunyuan Li
MLLM
VLM
LRM
302
91
0
03 Oct 2024
LMOD: A Large Multimodal Ophthalmology Dataset and Benchmark for Large Vision-Language Models
North American Chapter of the Association for Computational Linguistics (NAACL), 2024
Zhenyue Qin
Yu Yin
Dylan Campbell
Xuansheng Wu
Ke Zou
Yih-Chung Tham
Ninghao Liu
Xiuzhen Zhang
Qingyu Chen
257
7
0
02 Oct 2024
Leopard: A Vision Language Model For Text-Rich Multi-Image Tasks
Mengzhao Jia
Wenhao Yu
Kaixin Ma
Tianqing Fang
Z. Zhang
Siru Ouyang
Hongming Zhang
Meng Jiang
Dong Yu
VLM
265
11
0
02 Oct 2024
LLaVA-3D: A Simple yet Effective Pathway to Empowering LMMs with 3D-awareness
Chenming Zhu
Tai Wang
Wenwei Zhang
Jiangmiao Pang
Xihui Liu
593
114
0
26 Sep 2024
Can Vision Language Models Learn from Visual Demonstrations of Ambiguous Spatial Reasoning?
Bowen Zhao
Leo Parker Dirac
Paulina Varshavskaya
VLM
LRM
210
1
0
25 Sep 2024
EventHallusion: Diagnosing Event Hallucinations in Video LLMs
Jiacheng Zhang
Yang Jiao
Shaoxiang Chen
Na Zhao
Zhiyu Tan
Hao Li
Yue Yu
MLLM
496
38
0
25 Sep 2024
Multi-modal Generative AI: Multi-modal LLMs, Diffusions, and the Unification
X. Wang
Yuwei Zhou
Bin Huang
Hong Chen
Wenwu Zhu
DiffM
406
1
0
23 Sep 2024
JourneyBench: A Challenging One-Stop Vision-Language Understanding Benchmark of Generated Images
Neural Information Processing Systems (NeurIPS), 2024
Zhecan Wang
Junzhang Liu
Chia-Wei Tang
Hani Alomari
Anushka Sivakumar
...
Haoxuan You
A. Ishmam
Kai-Wei Chang
Shih-Fu Chang
Chris Thomas
CoGe
VLM
443
5
0
19 Sep 2024
Large Language Models are Strong Audio-Visual Speech Recognition Learners
IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2024
Umberto Cappellazzo
Minsu Kim
Honglie Chen
Pingchuan Ma
Stavros Petridis
Daniele Falavigna
Alessio Brutti
Maja Pantic
332
30
0
18 Sep 2024
NVLM: Open Frontier-Class Multimodal LLMs
Wenliang Dai
Nayeon Lee
Wei Ping
Zhuoling Yang
Zihan Liu
Jon Barker
Tuomas Rintamaki
Mohammad Shoeybi
Bryan Catanzaro
Ming-Yu Liu
MLLM
VLM
LRM
281
111
0
17 Sep 2024
Have Large Vision-Language Models Mastered Art History?
Ombretta Strafforello
Derya Soydaner
Michiel Willems
Anne-Sofie Maerten
Stefanie De Winter
CoGe
VLM
MLLM
180
2
0
05 Sep 2024
UrBench: A Comprehensive Benchmark for Evaluating Large Multimodal Models in Multi-View Urban Scenarios
AAAI Conference on Artificial Intelligence (AAAI), 2024
Baichuan Zhou
Haote Yang
Dairong Chen
Junyan Ye
Tianyi Bai
Jinhua Yu
Songyang Zhang
Dahua Lin
Conghui He
Weijia Li
VLM
278
24
0
30 Aug 2024
Law of Vision Representation in MLLMs
Shijia Yang
Bohan Zhai
Quanzeng You
Jianbo Yuan
Hongxia Yang
Chenfeng Xu
485
15
0
29 Aug 2024
Eagle: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders
Min Shi
Fuxiao Liu
Shihao Wang
Shijia Liao
Subhashree Radhakrishnan
...
Andrew Tao
Andrew Tao
Zhiding Yu
Guilin Liu
Guilin Liu
MLLM
358
110
0
28 Aug 2024
LMM-VQA: Advancing Video Quality Assessment with Large Multimodal Models
Qihang Ge
Wei Sun
Yu Zhang
Yunhao Li
Zhongpeng Ji
Fengyu Sun
Shangling Jui
Xiongkuo Min
Guangtao Zhai
179
22
0
26 Aug 2024
RoVRM: A Robust Visual Reward Model Optimized via Auxiliary Textual Preference Data
Chenglong Wang
Yang Gan
Yifu Huo
Yongyu Mu
Murun Yang
...
Chunliang Zhang
Tongran Liu
Quan Du
Di Yang
Jingbo Zhu
VLM
342
11
0
22 Aug 2024
CaRDiff: Video Salient Object Ranking Chain of Thought Reasoning for Saliency Prediction with Diffusion
AAAI Conference on Artificial Intelligence (AAAI), 2024
Yunlong Tang
Gen Zhan
Li Yang
Yiting Liao
Chenliang Xu
VGen
DiffM
LRM
328
13
0
21 Aug 2024
xGen-MM (BLIP-3): A Family of Open Large Multimodal Models
Le Xue
Manli Shu
Anas Awadalla
Jun Wang
An Yan
...
Zeyuan Chen
Silvio Savarese
Juan Carlos Niebles
Caiming Xiong
Ran Xu
VLM
429
139
0
16 Aug 2024
LLaVA-OneVision: Easy Visual Task Transfer
Bo Li
Yuanhan Zhang
Dong Guo
Renrui Zhang
Feng Li
Hao Zhang
Kaichen Zhang
Yanwei Li
Ziwei Liu
Chunyuan Li
MLLM
SyDa
VLM
463
1,673
0
06 Aug 2024
VidGen-1M: A Large-Scale Dataset for Text-to-video Generation
Zhiyu Tan
Xiaomeng Yang
Luozheng Qin
Hao Li
VGen
244
37
0
05 Aug 2024
Generalized Out-of-Distribution Detection and Beyond in Vision Language Model Era: A Survey
Atsuyuki Miyai
Jingkang Yang
Jingyang Zhang
Yifei Ming
Sisir Dhakal
...
Yixuan Li
Hai "Helen" Li
Ziwei Liu
Toshihiko Yamasaki
Kiyoharu Aizawa
323
28
0
31 Jul 2024
Effectively Leveraging CLIP for Generating Situational Summaries of Images and Videos
International Journal of Computer Vision (IJCV), 2024
Dhruv Verma
Debaditya Roy
Basura Fernando
246
3
0
30 Jul 2024
VLMEvalKit: An Open-Source Toolkit for Evaluating Large Multi-Modality Models
Haodong Duan
Xinyu Fang
Junming Yang
Xiangyu Zhao
Lin Chen
...
Yuhang Zang
Pan Zhang
Jiaqi Wang
Dahua Lin
Kai Chen
LM&MA
VLM
672
339
0
16 Jul 2024
LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models
Feng Li
Renrui Zhang
Hao Zhang
Yuanhan Zhang
Bo Li
Wei Li
Zejun Ma
Chunyuan Li
MLLM
VLM
315
415
0
10 Jul 2024
OmChat: A Recipe to Train Multimodal Language Models with Strong Long Context and Video Understanding
Tiancheng Zhao
Qianqian Zhang
Kyusong Lee
Peng Liu
Lu Zhang
Chunxin Fang
Jiajia Liao
Kelei Jiang
Yibo Ma
Ruochen Xu
MLLM
VLM
223
8
0
06 Jul 2024
MIA-Bench: Towards Better Instruction Following Evaluation of Multimodal LLMs
Yusu Qian
Hanrong Ye
J. Fauconnier
Peter Grasch
Yinfei Yang
Zhe Gan
550
38
0
01 Jul 2024
Previous
1
2
3
4
5
6
Next