ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2312.07533
  4. Cited By
VILA: On Pre-training for Visual Language Models
v1v2v3v4 (latest)

VILA: On Pre-training for Visual Language Models

Computer Vision and Pattern Recognition (CVPR), 2023
12 December 2023
Ji Lin
Hongxu Yin
Ming-Yu Liu
Yao Lu
Pavlo Molchanov
Andrew Tao
Huizi Mao
Jan Kautz
Mohammad Shoeybi
Song Han
    MLLMVLM
ArXiv (abs)PDFHTMLHuggingFace (23 upvotes)

Papers citing "VILA: On Pre-training for Visual Language Models"

50 / 275 papers shown
Title
RoboSpatial: Teaching Spatial Understanding to 2D and 3D Vision-Language Models for Robotics
RoboSpatial: Teaching Spatial Understanding to 2D and 3D Vision-Language Models for RoboticsComputer Vision and Pattern Recognition (CVPR), 2024
Chan Hee Song
Valts Blukis
Jonathan Tremblay
Stephen Tyree
Yu-Chuan Su
Stan Birchfield
748
76
0
25 Nov 2024
Human-Activity AGV Quality Assessment: A Benchmark Dataset and an Objective Evaluation Metric
Human-Activity AGV Quality Assessment: A Benchmark Dataset and an Objective Evaluation Metric
Zhichao Zhang
Wei Sun
Xinyue Li
Yunhao Li
Qihang Ge
...
Zhongpeng Ji
Fengyu Sun
Shangling Jui
Xiongkuo Min
Guoquan Zheng
EGVM
461
10
0
25 Nov 2024
AnyEdit: Mastering Unified High-Quality Image Editing for Any Idea
AnyEdit: Mastering Unified High-Quality Image Editing for Any IdeaComputer Vision and Pattern Recognition (CVPR), 2024
Qifan Yu
Wei Chow
Zhongqi Yue
Kaihang Pan
Yang Wu
Xiaoyang Wan
Juncheng Billy Li
Siliang Tang
Hao Zhang
Yueting Zhuang
DiffM
429
106
0
24 Nov 2024
Video-RAG: Visually-aligned Retrieval-Augmented Long Video Comprehension
Video-RAG: Visually-aligned Retrieval-Augmented Long Video Comprehension
Yongdong Luo
Xiawu Zheng
Guilin Li
Guilin Li
Haojia Lin
...
Jinfa Huang
Jiayi Ji
Jiebo Luo
Rongrong Ji
Rongrong Ji
VLM
542
67
0
20 Nov 2024
VILA-M3: Enhancing Vision-Language Models with Medical Expert Knowledge
VILA-M3: Enhancing Vision-Language Models with Medical Expert KnowledgeComputer Vision and Pattern Recognition (CVPR), 2024
Vishwesh Nath
Wenqi Li
Dong Yang
Andriy Myronenko
Mingxin Zheng
...
Holger Roth
Daguang Xu
Baris Turkbey
Holger Roth
Daguang Xu
VLM
475
22
0
19 Nov 2024
VidComposition: Can MLLMs Analyze Compositions in Compiled Videos?Computer Vision and Pattern Recognition (CVPR), 2024
Yunlong Tang
Junjia Guo
Hang Hua
Susan Liang
Mingqian Feng
...
Chao Huang
Jing Bi
Zeliang Zhang
Pooyan Fazli
Chenliang Xu
CoGe
360
16
0
17 Nov 2024
Motion-Grounded Video Reasoning: Understanding and Perceiving Motion at Pixel Level
Motion-Grounded Video Reasoning: Understanding and Perceiving Motion at Pixel LevelComputer Vision and Pattern Recognition (CVPR), 2024
Andong Deng
Tongjia Chen
Shoubin Yu
Taojiannan Yang
Lincoln Spencer
Yapeng Tian
Lin Wang
Joey Tianyi Zhou
Chen Chen
LRM
327
9
0
15 Nov 2024
StoryTeller: Improving Long Video Description through Global Audio-Visual Character Identification
StoryTeller: Improving Long Video Description through Global Audio-Visual Character Identification
Yichen He
Yuan Lin
Jianchao Wu
Hanchong Zhang
Yuchen Zhang
Ruicheng Le
VGenVLM
698
5
0
11 Nov 2024
Both Text and Images Leaked! A Systematic Analysis of Data Contamination in Multimodal LLM
Both Text and Images Leaked! A Systematic Analysis of Data Contamination in Multimodal LLM
D. Song
Sicheng Lai
Shunian Chen
Shunian Chen
Lichao Sun
Benyou Wang
959
2
0
06 Nov 2024
Situational Scene Graph for Structured Human-centric Situation Understanding
Situational Scene Graph for Structured Human-centric Situation UnderstandingIEEE Workshop/Winter Conference on Applications of Computer Vision (WACV), 2024
Chinthani Sugandhika
Chen Li
Deepu Rajan
Basura Fernando
953
3
0
30 Oct 2024
Senna: Bridging Large Vision-Language Models and End-to-End Autonomous
  Driving
Senna: Bridging Large Vision-Language Models and End-to-End Autonomous Driving
Bo Jiang
Shaoyu Chen
Bencheng Liao
Xingyu Zhang
Wei Yin
Qian Zhang
Chang Huang
Wen Liu
Xinyu Wang
VLMMLLMLRM
221
68
0
29 Oct 2024
AAAR-1.0: Assessing AI's Potential to Assist Research
AAAR-1.0: Assessing AI's Potential to Assist Research
Renze Lou
Hanzi Xu
Sijia Wang
Jiangshu Du
Ryo Kamoi
...
Xi Li
Jianchao Tan
Congying Xia
Lifu Huang
Wenpeng Yin
380
12
0
29 Oct 2024
COAT: Compressing Optimizer states and Activation for Memory-Efficient FP8 Training
COAT: Compressing Optimizer states and Activation for Memory-Efficient FP8 TrainingInternational Conference on Learning Representations (ICLR), 2024
Haocheng Xi
Han Cai
Ligeng Zhu
Yaojie Lu
Kurt Keutzer
Jianfei Chen
Song Han
MQ
373
16
0
25 Oct 2024
Improving Multimodal Large Language Models Using Continual Learning
Improving Multimodal Large Language Models Using Continual Learning
Shikhar Srivastava
Md Yousuf Harun
Robik Shrestha
Christopher Kanan
KELMVLMCLL
179
1
0
25 Oct 2024
VideoWebArena: Evaluating Long Context Multimodal Agents with Video Understanding Web Tasks
VideoWebArena: Evaluating Long Context Multimodal Agents with Video Understanding Web TasksInternational Conference on Learning Representations (ICLR), 2024
Lawrence Jang
Yinheng Li
Charles Ding
Justin Lin
Paul Pu Liang
Dan Zhao
Rogerio Bonatti
K. Koishida
336
22
0
24 Oct 2024
VoiceTextBlender: Augmenting Large Language Models with Speech Capabilities via Single-Stage Joint Speech-Text Supervised Fine-Tuning
VoiceTextBlender: Augmenting Large Language Models with Speech Capabilities via Single-Stage Joint Speech-Text Supervised Fine-TuningNorth American Chapter of the Association for Computational Linguistics (NAACL), 2024
Yifan Peng
Krishna Puvvada
Zhehuai Chen
Piotr .Zelasko
He Huang
Kunal Dhawan
Ke Hu
Shinji Watanabe
Jagadeesh Balam
Boris Ginsburg
313
7
0
23 Oct 2024
E3D-GPT: Enhanced 3D Visual Foundation for Medical Vision-Language Model
E3D-GPT: Enhanced 3D Visual Foundation for Medical Vision-Language Model
Zihang Jiang
Zihang Jiang
Qingsong Yao
Rongsheng Wang
Zhiyang He
Xiaodong Tao
Weifu Lv
Weifu Lv
Shuoling Zhou
VLMMedIm
132
10
0
18 Oct 2024
ViCToR: Improving Visual Comprehension via Token Reconstruction for Pretraining LMMs
ViCToR: Improving Visual Comprehension via Token Reconstruction for Pretraining LMMs
Yin Xie
Kaicheng Yang
Ninghua Yang
Weimo Deng
Xiangzi Dai
Tiancheng Gu
Yumeng Wang
Xiang An
Yongle Zhao
Ziyong Feng
MLLMVLM
312
1
0
18 Oct 2024
DreamVideo-2: Zero-Shot Subject-Driven Video Customization with Precise
  Motion Control
DreamVideo-2: Zero-Shot Subject-Driven Video Customization with Precise Motion Control
Yujie Wei
Shiwei Zhang
Hangjie Yuan
Xiang Wang
Haonan Qiu
...
Fan Liu
Zhizhong Huang
Jiaxin Ye
Yingya Zhang
Hongming Shan
DiffMVGen
287
29
0
17 Oct 2024
MM-Ego: Towards Building Egocentric Multimodal LLMs for Video QA
MM-Ego: Towards Building Egocentric Multimodal LLMs for Video QA
Hanrong Ye
Haotian Zhang
Erik Daxberger
Lin Chen
Zongyu Lin
...
Haoxuan You
Dan Xu
Zhe Gan
Jiasen Lu
Yinfei Yang
EgoVMLLM
243
18
0
09 Oct 2024
ActPlan-1K: Benchmarking the Procedural Planning Ability of Visual
  Language Models in Household Activities
ActPlan-1K: Benchmarking the Procedural Planning Ability of Visual Language Models in Household ActivitiesConference on Empirical Methods in Natural Language Processing (EMNLP), 2024
Ying Su
Zhan Ling
Haochen Shi
Jiayang Cheng
Yauwai Yim
Yangqiu Song
LM&Ro
130
8
0
04 Oct 2024
AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark
AuroraCap: Efficient, Performant Video Detailed Captioning and a New BenchmarkInternational Conference on Learning Representations (ICLR), 2024
Wenhao Chai
Enxin Song
Y. Du
Chenlin Meng
Vashisht Madhavan
Omer Bar-Tal
Jeng-Neng Hwang
Saining Xie
Christopher D. Manning
3DV
573
88
0
04 Oct 2024
Frame-Voyager: Learning to Query Frames for Video Large Language Models
Frame-Voyager: Learning to Query Frames for Video Large Language ModelsInternational Conference on Learning Representations (ICLR), 2024
Sicheng Yu
Chengkai Jin
Huanyu Wang
Zhenghao Chen
Sheng Jin
...
Zhenbang Sun
Bingni Zhang
Jiawei Wu
Hao Zhang
Qianru Sun
295
37
0
04 Oct 2024
LLaVA-Video: Video Instruction Tuning With Synthetic Data
LLaVA-Video: Video Instruction Tuning With Synthetic Data
Yuanhan Zhang
Jinming Wu
W. Li
Bo Li
Zejun Ma
Ziwei Liu
Chunyuan Li
SyDaVGen
428
363
0
03 Oct 2024
LLaVA-Critic: Learning to Evaluate Multimodal Models
LLaVA-Critic: Learning to Evaluate Multimodal ModelsComputer Vision and Pattern Recognition (CVPR), 2024
Tianyi Xiong
Xinze Wang
Dong Guo
Qinghao Ye
Haoqi Fan
Quanquan Gu
Heng Huang
Chunyuan Li
MLLMVLMLRM
302
91
0
03 Oct 2024
LMOD: A Large Multimodal Ophthalmology Dataset and Benchmark for Large Vision-Language Models
LMOD: A Large Multimodal Ophthalmology Dataset and Benchmark for Large Vision-Language ModelsNorth American Chapter of the Association for Computational Linguistics (NAACL), 2024
Zhenyue Qin
Yu Yin
Dylan Campbell
Xuansheng Wu
Ke Zou
Yih-Chung Tham
Ninghao Liu
Xiuzhen Zhang
Qingyu Chen
257
7
0
02 Oct 2024
Leopard: A Vision Language Model For Text-Rich Multi-Image Tasks
Leopard: A Vision Language Model For Text-Rich Multi-Image Tasks
Mengzhao Jia
Wenhao Yu
Kaixin Ma
Tianqing Fang
Z. Zhang
Siru Ouyang
Hongming Zhang
Meng Jiang
Dong Yu
VLM
265
11
0
02 Oct 2024
LLaVA-3D: A Simple yet Effective Pathway to Empowering LMMs with 3D-awareness
LLaVA-3D: A Simple yet Effective Pathway to Empowering LMMs with 3D-awareness
Chenming Zhu
Tai Wang
Wenwei Zhang
Jiangmiao Pang
Xihui Liu
593
114
0
26 Sep 2024
Can Vision Language Models Learn from Visual Demonstrations of Ambiguous
  Spatial Reasoning?
Can Vision Language Models Learn from Visual Demonstrations of Ambiguous Spatial Reasoning?
Bowen Zhao
Leo Parker Dirac
Paulina Varshavskaya
VLMLRM
210
1
0
25 Sep 2024
EventHallusion: Diagnosing Event Hallucinations in Video LLMs
EventHallusion: Diagnosing Event Hallucinations in Video LLMs
Jiacheng Zhang
Yang Jiao
Shaoxiang Chen
Na Zhao
Zhiyu Tan
Hao Li
Yue Yu
MLLM
496
38
0
25 Sep 2024
Multi-modal Generative AI: Multi-modal LLMs, Diffusions, and the Unification
Multi-modal Generative AI: Multi-modal LLMs, Diffusions, and the Unification
X. Wang
Yuwei Zhou
Bin Huang
Hong Chen
Wenwu Zhu
DiffM
406
1
0
23 Sep 2024
JourneyBench: A Challenging One-Stop Vision-Language Understanding Benchmark of Generated Images
JourneyBench: A Challenging One-Stop Vision-Language Understanding Benchmark of Generated ImagesNeural Information Processing Systems (NeurIPS), 2024
Zhecan Wang
Junzhang Liu
Chia-Wei Tang
Hani Alomari
Anushka Sivakumar
...
Haoxuan You
A. Ishmam
Kai-Wei Chang
Shih-Fu Chang
Chris Thomas
CoGeVLM
443
5
0
19 Sep 2024
Large Language Models are Strong Audio-Visual Speech Recognition Learners
Large Language Models are Strong Audio-Visual Speech Recognition LearnersIEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2024
Umberto Cappellazzo
Minsu Kim
Honglie Chen
Pingchuan Ma
Stavros Petridis
Daniele Falavigna
Alessio Brutti
Maja Pantic
332
30
0
18 Sep 2024
NVLM: Open Frontier-Class Multimodal LLMs
NVLM: Open Frontier-Class Multimodal LLMs
Wenliang Dai
Nayeon Lee
Wei Ping
Zhuoling Yang
Zihan Liu
Jon Barker
Tuomas Rintamaki
Mohammad Shoeybi
Bryan Catanzaro
Ming-Yu Liu
MLLMVLMLRM
281
111
0
17 Sep 2024
Have Large Vision-Language Models Mastered Art History?
Have Large Vision-Language Models Mastered Art History?
Ombretta Strafforello
Derya Soydaner
Michiel Willems
Anne-Sofie Maerten
Stefanie De Winter
CoGeVLMMLLM
180
2
0
05 Sep 2024
UrBench: A Comprehensive Benchmark for Evaluating Large Multimodal Models in Multi-View Urban Scenarios
UrBench: A Comprehensive Benchmark for Evaluating Large Multimodal Models in Multi-View Urban ScenariosAAAI Conference on Artificial Intelligence (AAAI), 2024
Baichuan Zhou
Haote Yang
Dairong Chen
Junyan Ye
Tianyi Bai
Jinhua Yu
Songyang Zhang
Dahua Lin
Conghui He
Weijia Li
VLM
278
24
0
30 Aug 2024
Law of Vision Representation in MLLMs
Law of Vision Representation in MLLMs
Shijia Yang
Bohan Zhai
Quanzeng You
Jianbo Yuan
Hongxia Yang
Chenfeng Xu
485
15
0
29 Aug 2024
Eagle: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders
Eagle: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders
Min Shi
Fuxiao Liu
Shihao Wang
Shijia Liao
Subhashree Radhakrishnan
...
Andrew Tao
Andrew Tao
Zhiding Yu
Guilin Liu
Guilin Liu
MLLM
358
110
0
28 Aug 2024
LMM-VQA: Advancing Video Quality Assessment with Large Multimodal Models
LMM-VQA: Advancing Video Quality Assessment with Large Multimodal Models
Qihang Ge
Wei Sun
Yu Zhang
Yunhao Li
Zhongpeng Ji
Fengyu Sun
Shangling Jui
Xiongkuo Min
Guangtao Zhai
179
22
0
26 Aug 2024
RoVRM: A Robust Visual Reward Model Optimized via Auxiliary Textual Preference Data
RoVRM: A Robust Visual Reward Model Optimized via Auxiliary Textual Preference Data
Chenglong Wang
Yang Gan
Yifu Huo
Yongyu Mu
Murun Yang
...
Chunliang Zhang
Tongran Liu
Quan Du
Di Yang
Jingbo Zhu
VLM
342
11
0
22 Aug 2024
CaRDiff: Video Salient Object Ranking Chain of Thought Reasoning for Saliency Prediction with Diffusion
CaRDiff: Video Salient Object Ranking Chain of Thought Reasoning for Saliency Prediction with DiffusionAAAI Conference on Artificial Intelligence (AAAI), 2024
Yunlong Tang
Gen Zhan
Li Yang
Yiting Liao
Chenliang Xu
VGenDiffMLRM
328
13
0
21 Aug 2024
xGen-MM (BLIP-3): A Family of Open Large Multimodal Models
xGen-MM (BLIP-3): A Family of Open Large Multimodal Models
Le Xue
Manli Shu
Anas Awadalla
Jun Wang
An Yan
...
Zeyuan Chen
Silvio Savarese
Juan Carlos Niebles
Caiming Xiong
Ran Xu
VLM
429
139
0
16 Aug 2024
LLaVA-OneVision: Easy Visual Task Transfer
LLaVA-OneVision: Easy Visual Task Transfer
Bo Li
Yuanhan Zhang
Dong Guo
Renrui Zhang
Feng Li
Hao Zhang
Kaichen Zhang
Yanwei Li
Ziwei Liu
Chunyuan Li
MLLMSyDaVLM
463
1,673
0
06 Aug 2024
VidGen-1M: A Large-Scale Dataset for Text-to-video Generation
VidGen-1M: A Large-Scale Dataset for Text-to-video Generation
Zhiyu Tan
Xiaomeng Yang
Luozheng Qin
Hao Li
VGen
244
37
0
05 Aug 2024
Generalized Out-of-Distribution Detection and Beyond in Vision Language Model Era: A Survey
Generalized Out-of-Distribution Detection and Beyond in Vision Language Model Era: A Survey
Atsuyuki Miyai
Jingkang Yang
Jingyang Zhang
Yifei Ming
Sisir Dhakal
...
Yixuan Li
Hai "Helen" Li
Ziwei Liu
Toshihiko Yamasaki
Kiyoharu Aizawa
323
28
0
31 Jul 2024
Effectively Leveraging CLIP for Generating Situational Summaries of Images and Videos
Effectively Leveraging CLIP for Generating Situational Summaries of Images and VideosInternational Journal of Computer Vision (IJCV), 2024
Dhruv Verma
Debaditya Roy
Basura Fernando
246
3
0
30 Jul 2024
VLMEvalKit: An Open-Source Toolkit for Evaluating Large Multi-Modality Models
VLMEvalKit: An Open-Source Toolkit for Evaluating Large Multi-Modality Models
Haodong Duan
Xinyu Fang
Junming Yang
Xiangyu Zhao
Lin Chen
...
Yuhang Zang
Pan Zhang
Jiaqi Wang
Dahua Lin
Kai Chen
LM&MAVLM
672
339
0
16 Jul 2024
LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large
  Multimodal Models
LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models
Feng Li
Renrui Zhang
Hao Zhang
Yuanhan Zhang
Bo Li
Wei Li
Zejun Ma
Chunyuan Li
MLLMVLM
315
415
0
10 Jul 2024
OmChat: A Recipe to Train Multimodal Language Models with Strong Long
  Context and Video Understanding
OmChat: A Recipe to Train Multimodal Language Models with Strong Long Context and Video Understanding
Tiancheng Zhao
Qianqian Zhang
Kyusong Lee
Peng Liu
Lu Zhang
Chunxin Fang
Jiajia Liao
Kelei Jiang
Yibo Ma
Ruochen Xu
MLLMVLM
223
8
0
06 Jul 2024
MIA-Bench: Towards Better Instruction Following Evaluation of Multimodal LLMs
MIA-Bench: Towards Better Instruction Following Evaluation of Multimodal LLMs
Yusu Qian
Hanrong Ye
J. Fauconnier
Peter Grasch
Yinfei Yang
Zhe Gan
550
38
0
01 Jul 2024
Previous
123456
Next