Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
2402.11684
Cited By
v1
v2 (latest)
ALLaVA: Harnessing GPT4V-Synthesized Data for Lite Vision-Language Models
18 February 2024
Guiming Hardy Chen
Shunian Chen
Ruifei Zhang
Junying Chen
Xiangbo Wu
Zhiyi Zhang
Zhihong Chen
Jianquan Li
Xiang Wan
Benyou Wang
VLM
SyDa
Re-assign community
ArXiv (abs)
PDF
HTML
HuggingFace (2 upvotes)
Github (281★)
Papers citing
"ALLaVA: Harnessing GPT4V-Synthesized Data for Lite Vision-Language Models"
50 / 82 papers shown
Text-Printed Image: Bridging the Image-Text Modality Gap for Text-centric Training of Large Vision-Language Models
Shojiro Yamabe
Futa Waseda
Daiki Shiono
Tsubasa Takahashi
DiffM
MLLM
VLM
293
1
0
03 Dec 2025
LocateAnything3D: Vision-Language 3D Detection with Chain-of-Sight
Yunze Man
S. S. Wang
Guowen Zhang
Johan Bjorck
Zhiqi Li
Liang-Yan Gui
Jim Fan
Jan Kautz
Yu Wang
Zhiding Yu
173
1
0
25 Nov 2025
Downscaling Intelligence: Exploring Perception and Reasoning Bottlenecks in Small Multimodal Models
Mark Endo
Serena Yeung-Levy
LRM
284
1
0
21 Nov 2025
NVIDIA Nemotron Nano V2 VL
Nvidia
Amala Sanjay Deshmukh
Kateryna Chumachenko
Tuomas Rintamaki
Matthieu Le
...
Krzysztof Pawelec
Michael Evans
Katherine Luna
Jie Lou
Erick Galinkin
VLM
397
5
0
06 Nov 2025
FineVision: Open Data Is All You Need
Luis Wiedmann
Orr Zohar
Amir Mahla
Xiaohan Wang
Rui Li
Thibaud Frere
Leandro von Werra
Aritra Roy Gosthipaty
Andrés Marafioti
VLM
231
18
0
20 Oct 2025
From Pixels to Words -- Towards Native Vision-Language Primitives at Scale
Haiwen Diao
Mingxuan Li
Silei Wu
Linjun Dai
Xiaohua Wang
Hanming Deng
Lewei Lu
Dahua Lin
Ziwei Liu
VLM
213
4
0
16 Oct 2025
Where on Earth? A Vision-Language Benchmark for Probing Model Geolocation Skills Across Scales
Zhaofang Qian
Hardy Chen
Zeyu Wang
Li Zhang
Zijun Wang
...
Xianfeng Tang
Zeyu Zheng
Haoqin Tu
Cihang Xie
Yuyin Zhou
LRM
125
2
0
13 Oct 2025
Dirichlet-Prior Shaping: Guiding Expert Specialization in Upcycled MoEs
Leyla Mirvakhabova
B. Bejnordi
Gaurav Kumar
Hanxue Liang
Wanru Zhao
Paul N. Whatmough
MoE
118
1
0
01 Oct 2025
CapRL: Stimulating Dense Image Caption Capabilities via Reinforcement Learning
Long Xing
Xiaoyi Dong
Yuhang Zang
Yuhang Cao
Jianze Liang
Qidong Huang
Jiaqi Wang
Feng Wu
Dahua Lin
OffRL
VLM
209
12
0
26 Sep 2025
MindVL: Towards Efficient and Effective Training of Multimodal Large Language Models on Ascend NPUs
Feilong Chen
Y. Liu
Yi Huang
Hao Wang
Miren Tian
Ya-Qi Yu
Minghui Liao
Jihao Wu
MLLM
VLM
445
2
0
15 Sep 2025
OneCAT: Decoder-Only Auto-Regressive Model for Unified Understanding and Generation
Han Li
Xinyu Peng
Y. Wang
Zelin Peng
Xin Chen
Rongxiang Weng
Jingang Wang
Xunliang Cai
Wenrui Dai
Hongkai Xiong
MLLM
OffRL
412
25
0
03 Sep 2025
UItron: Foundational GUI Agent with Advanced Perception and Planning
Zhixiong Zeng
Jing Huang
Liming Zheng
Wenkang Han
Yufeng Zhong
Lei Chen
Longrong Yang
Yingjie Chu
Yuzhi He
Lin Ma
LLMAG
225
12
0
29 Aug 2025
UniUGG: Unified 3D Understanding and Generation via Geometric-Semantic Encoding
Yueming Xu
Jiahui Zhang
Ze Huang
Yurui Chen
Yanpeng Zhou
...
Zhongang Qi
Xingyue Quan
Jianye Hao
Hang Xu
Li Zhang
281
4
0
16 Aug 2025
MoIIE: Mixture of Intra- and Inter-Modality Experts for Large Vision Language Models
Dianyi Wang
Siyuan Wang
Zejun Li
Yikun Wang
Yitong Li
Duyu Tang
Xiaoyu Shen
Xuanjing Huang
Zhongyu Wei
MoE
243
1
0
13 Aug 2025
NEP: Autoregressive Image Editing via Next Editing Token Prediction
Huimin Wu
Xiaojian Ma
Haozhe Zhao
Yanpeng Zhao
Qing Li
DiffM
185
4
0
08 Aug 2025
MAGE: Multimodal Alignment and Generation Enhancement via Bridging Visual and Semantic Spaces
International Joint Conference on Artificial Intelligence (IJCAI), 2025
Shaojun E
Yuchen Yang
Jiaheng Wu
Yan Zhang
Tiejun Zhao
Ziyan Chen
221
1
0
29 Jul 2025
LMM-Det: Make Large Multimodal Models Excel in Object Detection
Jincheng Li
Chunyu Xie
Ji Ao
Dawei Leng
Yuhui Yin
MLLM
ObjD
VLM
420
9
0
24 Jul 2025
Advancing Multimodal LLMs by Large-Scale 3D Visual Instruction Dataset Generation
Liu He
Xiao Zeng
Yizhi Song
Albert Y. C. Chen
Lu Xia
Shashwat Verma
Sankalp Dayal
Min Sun
Cheng-Hao Kuo
Daniel G. Aliaga
VGen
300
0
0
11 Jul 2025
RationalVLA: A Rational Vision-Language-Action Model with Dual System
Wenxuan Song
Jiayi Chen
Wenxue Li
Xu He
Han Zhao
...
Xinhu Zheng
Yanfeng Guo
Hesheng Wang
Yunhui Liu
Haoang Li
LM&Ro
552
15
0
12 Jun 2025
VisualSphinx: Large-Scale Synthetic Vision Logic Puzzles for RL
Yichen Feng
Zhangchen Xu
Fengqing Jiang
Yuetai Li
Bhaskar Ramasubramanian
Luyao Niu
Bill Yuchen Lin
Radha Poovendran
ReLM
LRM
196
12
0
29 May 2025
Slot-MLLM: Object-Centric Visual Tokenization for Multimodal LLM
Donghwan Chi
Hyomin Kim
Yoonjin Oh
Yongjin Kim
Donghoon Lee
DaeJin Jo
Jongmin Kim
Junyeob Baek
Sungjin Ahn
Sungwoong Kim
MLLM
VLM
1.0K
2
0
23 May 2025
Instructify: Demystifying Metadata to Visual Instruction Tuning Data Conversion
Jacob A. Hansen
Wei Lin
Junmo Kang
M. Jehanzeb Mirza
Hongyin Luo
Rogerio Feris
Alan Ritter
James R. Glass
Leonid Karlinsky
VLM
540
1
0
23 May 2025
Visual Instruction Tuning with Chain of Region-of-Interest
Yixin Chen
Shuai Zhang
Boran Han
Bernie Wang
326
2
0
11 May 2025
Unified Multimodal Understanding and Generation Models: Advances, Challenges, and Opportunities
Wei Wei
Jintao Guo
Shanshan Zhao
Minghao Fu
Lunhao Duan
...
Guo-Hua Wang
Qing-Guo Chen
Zhao Xu
Weihua Luo
Kaifu Zhang
DiffM
1.4K
45
0
05 May 2025
Low-hallucination Synthetic Captions for Large-Scale Vision-Language Model Pre-training
Xinsong Zhang
Yarong Zeng
Xinting Huang
Hu Hu
Runquan Xie
Han Hu
Zhanhui Kang
MLLM
VLM
580
5
0
17 Apr 2025
Data Metabolism: An Efficient Data Design Schema For Vision Language Model
Jingyuan Zhang
Hongzhi Zhang
Zhou Haonan
Chenxi Sun
Xingguang Ji
Jiakang Wang
Fanheng Kong
Wenshu Fan
Qi Wang
Fuzheng Zhang
VLM
405
2
0
10 Apr 2025
MM-IFEngine: Towards Multimodal Instruction Following
Shengyuan Ding
Shenxi Wu
Xiangyu Zhao
Yuhang Zang
Haodong Duan
Xiaoyi Dong
Pan Zhang
Yuhang Cao
Dahua Lin
Jiaqi Wang
OffRL
647
29
0
10 Apr 2025
UniToken: Harmonizing Multimodal Understanding and Generation through Unified Visual Encoding
Yang Jiao
Haibo Qiu
Zequn Jie
Tian Jin
Yue Yu
Lin Ma
Yu Jiang
340
38
0
06 Apr 2025
UniViTAR: Unified Vision Transformer with Native Resolution
Limeng Qiao
Yiyang Gan
Bairui Wang
Jie Qin
Shuang Xu
Siqi Yang
Lin Ma
557
4
0
02 Apr 2025
Skip-Vision: Efficient and Scalable Acceleration of Vision-Language Models via Adaptive Token Skipping
Weili Zeng
Ziyuan Huang
Kaixiang Ji
Manwen Liao
VLM
772
6
0
26 Mar 2025
Instruction-Oriented Preference Alignment for Enhancing Multi-Modal Comprehension Capability of MLLMs
Zitian Wang
Yue Liao
Kang Rong
Fengyun Rao
Yibo Yang
Si Liu
376
1
0
26 Mar 2025
From Head to Tail: Towards Balanced Representation in Large Vision-Language Models through Adaptive Data Calibration
Computer Vision and Pattern Recognition (CVPR), 2025
Mingyang Song
Xiaoye Qu
Jiawei Zhou
Yu Cheng
VLM
616
6
0
17 Mar 2025
HiMTok: Learning Hierarchical Mask Tokens for Image Segmentation with Large Multimodal Model
Tao Wang
Changxu Cheng
Lingfeng Wang
Senda Chen
Wuyue Zhao
VLM
455
9
0
17 Mar 2025
Vamba: Understanding Hour-Long Videos with Hybrid Mamba-Transformers
Weiming Ren
Wentao Ma
Huan Yang
Cong Wei
Ge Zhang
Lei Ma
Mamba
444
27
0
14 Mar 2025
Oasis: One Image is All You Need for Multimodal Instruction Data Synthesis
Letian Zhang
Quan Cui
Bingchen Zhao
Cheng Yang
MLLM
SyDa
545
9
0
11 Mar 2025
Referring to Any Person
Qing Jiang
Lin Wu
Zhaoyang Zeng
Tianhe Ren
Yuda Xiong
Yihao Chen
Qin Liu
Lei Zhang
975
15
0
11 Mar 2025
PP-DocBee: Improving Multimodal Document Understanding Through a Bag of Tricks
Feng Ni
Kui Huang
Yao Lu
Wenyu Lv
Guanzhong Wang
Zeyu Chen
Wenshu Fan
VLM
477
2
0
06 Mar 2025
RoboBrain: A Unified Brain Model for Robotic Manipulation from Abstract to Concrete
Computer Vision and Pattern Recognition (CVPR), 2025
Yuheng Ji
Huajie Tan
Jiayu Shi
Xiaoshuai Hao
Yuan Zhang
...
Huaihai Lyu
Xiaolong Zheng
Jiaming Liu
Zhongyuan Wang
Shanghang Zhang
583
114
0
28 Feb 2025
M2-omni: Advancing Omni-MLLM for Comprehensive Modality Support with Competitive Performance
Qingpei Guo
Kaiyou Song
Zipeng Feng
Ziping Ma
Qinglong Zhang
...
Yunxiao Sun
Tai-WeiChang
Jingdong Chen
Ming Yang
Jun Zhou
MLLM
VLM
749
14
0
26 Feb 2025
Megrez-Omni Technical Report
Boxun Li
Yadong Li
Hui Yuan
Congyi Liu
Weilin Liu
...
Dong Zhou
Yueqing Zhuang
Shengen Yan
Guohao Dai
Longji Xu
265
2
0
19 Feb 2025
Soundwave: Less is More for Speech-Text Alignment in LLMs
Annual Meeting of the Association for Computational Linguistics (ACL), 2025
Yunke Zhang
Zhiheng Liu
Fan Bu
Ruiyu Zhang
Benyou Wang
Haoyang Li
AuLLM
SyDa
VLM
320
8
0
18 Feb 2025
Vision-Language Models for Edge Networks: A Comprehensive Survey
IEEE Internet of Things Journal (IEEE IoT J.), 2025
Ahmed Sharshar
Latif U. Khan
Waseem Ullah
Mohsen Guizani
VLM
404
3
0
11 Feb 2025
InternVideo2.5: Empowering Video MLLMs with Long and Rich Context Modeling
Yi Wang
Xinhao Li
Ziang Yan
Yinan He
Jiashuo Yu
...
Kai Chen
Wenhai Wang
Yu Qiao
Yali Wang
Limin Wang
675
148
0
21 Jan 2025
Social-LLaVA: Enhancing Robot Navigation through Human-Language Reasoning in Social Spaces
Amirreza Payandeh
Daeun Song
Mohammad Nazeri
Jing Liang
Praneel Mukherjee
Amir Hossain Raj
Yangzhe Kong
Dinesh Manocha
Xuesu Xiao
LM&Ro
LRM
483
22
0
17 Jan 2025
VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks
Neural Information Processing Systems (NeurIPS), 2024
Jiannan Wu
Muyan Zhong
Sen Xing
Zeqiang Lai
Zhaoyang Liu
...
Lewei Lu
Tong Lu
Ping Luo
Yu Qiao
Jifeng Dai
MLLM
VLM
LRM
1.0K
141
0
03 Jan 2025
VideoChat-Flash: Hierarchical Compression for Long-Context Video Modeling
Xinhao Li
Yi Wang
Jiashuo Yu
Xiangyu Zeng
Yuhan Zhu
...
Yinan He
Chenting Wang
Yu Qiao
Yali Wang
L. Wang
VLM
1.0K
145
0
31 Dec 2024
HoVLE: Unleashing the Power of Monolithic Vision-Language Models with Holistic Vision-Language Embedding
Computer Vision and Pattern Recognition (CVPR), 2024
Chenxin Tao
Shiqian Su
X. Zhu
Chenyu Zhang
Zhe Chen
...
Wenhai Wang
Lewei Lu
Gao Huang
Yu Qiao
Jifeng Dai
MLLM
VLM
591
7
0
20 Dec 2024
VLsI: Verbalized Layers-to-Interactions from Large to Small Vision Language Models
Computer Vision and Pattern Recognition (CVPR), 2024
Byung-Kwan Lee
Ryo Hachiuma
Yu-Chiang Frank Wang
Y. Ro
Yueh-Hua Wu
VLM
489
8
0
02 Dec 2024
On Domain-Adaptive Post-Training for Multimodal Large Language Models
Daixuan Cheng
Shaohan Huang
Ziyu Zhu
Xintong Zhang
Wayne Xin Zhao
Zhongzhi Luan
Bo Dai
Zhenliang Zhang
VLM
550
5
0
29 Nov 2024
ChatRex: Taming Multimodal LLM for Joint Perception and Understanding
Qing Jiang
Gen Luo
Yuqin Yang
Yuda Xiong
Yihao Chen
Zhaoyang Zeng
Tianhe Ren
Lei Zhang
VLM
LRM
653
25
0
27 Nov 2024
1
2
Next
Page 1 of 2