Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2304.08485
Cited By
Visual Instruction Tuning
17 April 2023
Haotian Liu
Chunyuan Li
Qingyang Wu
Yong Jae Lee
SyDa
VLM
MLLM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Visual Instruction Tuning"
50 / 2,159 papers shown
Title
Instruction-augmented Multimodal Alignment for Image-Text and Element Matching
Xinli Yue
Jianhui Sun
Junda Lu
Liangchao Yao
Fan Xia
Tianyi Wang
Fengyun Rao
Jing Lyu
Yuetang Deng
21
0
0
16 Apr 2025
Aligning Generative Denoising with Discriminative Objectives Unleashes Diffusion for Visual Perception
Ziqi Pang
Xin Xu
Yu-Xiong Wang
DiffM
60
0
0
15 Apr 2025
PVUW 2025 Challenge Report: Advances in Pixel-level Understanding of Complex Videos in the Wild
Henghui Ding
Chang Liu
Nikhila Ravi
Shuting He
Y. Wei
...
Haobo Yuan
X. Li
Tao Zhang
Lu Qi
Ming Yang
25
0
0
15 Apr 2025
GaSLight: Gaussian Splats for Spatially-Varying Lighting in HDR
Christophe Bolduc
Yannick Hold-Geoffroy
Zhixin Shu
Jean-François Lalonde
3DGS
33
0
0
15 Apr 2025
Video Summarization with Large Language Models
Min Jung Lee
Dayoung Gong
Minsu Cho
26
0
0
15 Apr 2025
DeepMLF: Multimodal language model with learnable tokens for deep fusion in sentiment analysis
Efthymios Georgiou
V. Katsouros
Yannis Avrithis
Alexandros Potamianos
24
1
0
15 Apr 2025
SilVar-Med: A Speech-Driven Visual Language Model for Explainable Abnormality Detection in Medical Imaging
Tan-Hanh Pham
Chris Ngo
Trong-Duong Bui
Minh Luu Quang
Tan-Huong Pham
Truong Son-Hy
27
0
0
14 Apr 2025
VDocRAG: Retrieval-Augmented Generation over Visually-Rich Documents
Ryota Tanaka
Taichi Iki
Taku Hasegawa
Kyosuke Nishida
Kuniko Saito
Jun Suzuki
VLM
47
0
0
14 Apr 2025
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
Jinguo Zhu
Weiyun Wang
Zhe Chen
Z. Liu
Shenglong Ye
...
D. Lin
Yu Qiao
Jifeng Dai
Wenhai Wang
W. Wang
MLLM
VLM
66
7
1
14 Apr 2025
MIEB: Massive Image Embedding Benchmark
Chenghao Xiao
Isaac Chung
Imene Kerboua
Jamie Stirling
Xin Zhang
Márton Kardos
Roman Solomatin
Noura Al Moubayed
K. Enevoldsen
Niklas Muennighoff
VLM
35
0
0
14 Apr 2025
Pixel-SAIL: Single Transformer For Pixel-Grounded Understanding
Tao Zhang
X. Li
Zilong Huang
Y. Li
Weixian Lei
XueQing Deng
Shihao Chen
S. Ji
Jiashi Feng
MLLM
LRM
56
1
0
14 Apr 2025
ReasonDrive: Efficient Visual Question Answering for Autonomous Vehicles with Reasoning-Enhanced Small Vision-Language Models
Amirhosein Chahe
Lifeng Zhou
LRM
33
0
0
14 Apr 2025
AgMMU: A Comprehensive Agricultural Multimodal Understanding and Reasoning Benchmark
Aruna Gauba
Irene Pi
Yunze Man
Ziqi Pang
Vikram S. Adve
Yu-Xiong Wang
80
0
0
14 Apr 2025
Mavors: Multi-granularity Video Representation for Multimodal Large Language Model
Yang Shi
Jiaheng Liu
Yushuo Guan
Z. Wu
Y. Zhang
...
Bohan Zeng
W. Zhang
Fuzheng Zhang
Wenjing Yang
Di Zhang
VGen
VLM
69
0
0
14 Apr 2025
LangPert: Detecting and Handling Task-level Perturbations for Robust Object Rearrangement
Xu Yin
Min-Sung Yoon
Yuchi Huo
Kang Zhang
Sung-eui Yoon
26
0
0
14 Apr 2025
The Scalability of Simplicity: Empirical Analysis of Vision-Language Learning with a Single Transformer
Weixian Lei
Jiacong Wang
Haochen Wang
X. Li
Jun Hao Liew
Jiashi Feng
Zilong Huang
26
1
0
14 Apr 2025
Socratic Chart: Cooperating Multiple Agents for Robust SVG Chart Understanding
Yuyang Ji
Haohan Wang
LRM
31
0
0
14 Apr 2025
Do We Really Need Curated Malicious Data for Safety Alignment in Multi-modal Large Language Models?
Yanbo Wang
Jiyang Guan
Jian Liang
Ran He
43
0
0
14 Apr 2025
FUSION: Fully Integration of Vision-Language Representations for Deep Cross-Modal Understanding
Zheng Liu
Mengjie Liu
J. Chen
Jingwei Xu
Bin Cui
Conghui He
Wentao Zhang
MLLM
57
0
0
14 Apr 2025
Improving Multimodal Hateful Meme Detection Exploiting LMM-Generated Knowledge
Maria Tzelepi
Vasileios Mezaris
34
0
0
14 Apr 2025
The Mirage of Performance Gains: Why Contrastive Decoding Fails to Address Multimodal Hallucination
Hao Yin
Gunagzong Si
Zilei Wang
92
0
0
14 Apr 2025
GenEDA: Unleashing Generative Reasoning on Netlist via Multimodal Encoder-Decoder Aligned Foundation Model
Wenji Fang
Jing Wang
Yao Lu
Shang Liu
Zhiyao Xie
AI4CE
34
1
0
13 Apr 2025
AirVista-II: An Agentic System for Embodied UAVs Toward Dynamic Scene Semantic Understanding
Fei Lin
Yonglin Tian
Tengchao Zhang
Jun Huang
Sangtian Guan
Fei-Yue Wang
35
2
0
13 Apr 2025
SegEarth-R1: Geospatial Pixel Reasoning via Large Language Model
Kaiyu Li
Zepeng Xin
Li Pang
Chao Pang
Yupeng Deng
Jing Yao
Guisong Xia
Deyu Meng
Zhi Wang
Xiangyong Cao
VLM
LRM
37
0
0
13 Apr 2025
BabyVLM: Data-Efficient Pretraining of VLMs Inspired by Infant Learning
Shengao Wang
Arjun Chandra
Aoming Liu
Venkatesh Saligrama
Boqing Gong
MLLM
VLM
45
0
0
13 Apr 2025
Efficient and Asymptotically Unbiased Constrained Decoding for Large Language Models
Haotian Ye
Himanshu Jain
Chong You
A. Suresh
Haowei Lin
James Zou
Felix X. Yu
31
0
0
12 Apr 2025
SDIGLM: Leveraging Large Language Models and Multi-Modal Chain of Thought for Structural Damage Identification
Y. Zhang
Shiyin Wei
Yong Huang
Yawu Su
Shanshan Lu
Hui Li
AI4CE
24
0
0
12 Apr 2025
Evolved Hierarchical Masking for Self-Supervised Learning
Zhanzhou Feng
Shiliang Zhang
37
0
0
12 Apr 2025
PathVLM-R1: A Reinforcement Learning-Driven Reasoning Model for Pathology Visual-Language Tasks
J. Wu
Hao Yang
Xinhua Zeng
Guibing He
Z. Chen
Z. Li
X. Zhang
Yangyang Ma
Run Fang
Yang Liu
LRM
88
0
0
12 Apr 2025
FindAnything: Open-Vocabulary and Object-Centric Mapping for Robot Exploration in Any Environment
Sebastián Barbas Laina
Simon Boche
Sotiris Papatheodorou
Simon Schaefer
Jaehyung Jung
Stefan Leutenegger
44
0
0
11 Apr 2025
F
3
^3
3
Set: Towards Analyzing Fast, Frequent, and Fine-grained Events from Videos
Zhaoyu Liu
Kan Jiang
Murong Ma
Zhé Hóu
Yun Lin
J. Dong
35
0
0
11 Apr 2025
AstroLLaVA: towards the unification of astronomical data and natural language
Sharaf Zaman
Michael J. Smith
P. Khetarpal
Rishabh Chakrabarty
Michele Ginolfi
...
Maja Jabłońska
Sandor Kruk
Matthieu Le Lain
Sergio J. Rodríguez Méndez
Dimitrios Tanoglidis
31
0
0
11 Apr 2025
Ego4o: Egocentric Human Motion Capture and Understanding from Multi-Modal Input
Jian Wang
Rishabh Dabral
D. Luvizon
Zhe Cao
Lingjie Liu
Thabo Beeler
Christian Theobalt
EgoV
45
0
0
11 Apr 2025
Spatial Audio Processing with Large Language Model on Wearable Devices
Ayushi Mishra
Yang Bai
Priyadarshan Narayanasamy
Nakul Garg
Nirupam Roy
30
0
0
11 Apr 2025
VLMT: Vision-Language Multimodal Transformer for Multimodal Multi-hop Question Answering
Qi Zhi Lim
C. Lee
K. Lim
Kalaiarasi Sonai Muthu Anbananthen
31
0
0
11 Apr 2025
FocalLens: Instruction Tuning Enables Zero-Shot Conditional Image Representations
Cheng-Yu Hsieh
Pavan Kumar Anasosalu Vasu
Fartash Faghri
Raviteja Vemulapalli
Chun-Liang Li
Ranjay Krishna
Oncel Tuzel
Hadi Pouransari
VLM
99
0
0
11 Apr 2025
Steering CLIP's vision transformer with sparse autoencoders
Sonia Joseph
Praneet Suresh
Ethan Goldfarb
Lorenz Hufe
Yossi Gandelsman
Robert Graham
Danilo Bzdok
Wojciech Samek
Blake A. Richards
51
2
0
11 Apr 2025
PACT: Pruning and Clustering-Based Token Reduction for Faster Visual Language Models
M. Dhouib
Davide Buscaldi
Sonia Vanier
A. Shabou
VLM
36
0
0
11 Apr 2025
Mimic In-Context Learning for Multimodal Tasks
Yuchu Jiang
Jiale Fu
Chenduo Hao
Xinting Hu
Yingzhe Peng
Xin Geng
Xu Yang
27
0
0
11 Apr 2025
Why We Feel: Breaking Boundaries in Emotional Reasoning with Multimodal Large Language Models
Yuxiang Lin
Jingdong Sun
Zhi-Qi Cheng
Jue Wang
Haomin Liang
Zebang Cheng
Yifei Dong
Jun-Yan He
Xiaojiang Peng
Xian-Sheng Hua
41
0
0
10 Apr 2025
Memory-efficient Streaming VideoLLMs for Real-time Procedural Video Understanding
Dibyadip Chatterjee
Edoardo Remelli
Yale Song
Bugra Tekin
Abhay Mittal
...
Shreyas Hampali
Eric Sauser
Shugao Ma
Angela Yao
Fadime Sener
VLM
35
0
0
10 Apr 2025
SoTA with Less: MCTS-Guided Sample Selection for Data-Efficient Visual Reasoning Self-Improvement
X. Wang
Z. Yang
Chao Feng
Hongjin Lu
Linjie Li
Chung-Ching Lin
Kevin Qinghong Lin
Furong Huang
Lijuan Wang
OODD
ReLM
VLM
LRM
69
1
0
10 Apr 2025
MM-IFEngine: Towards Multimodal Instruction Following
Shengyuan Ding
Shenxi Wu
Xiangyu Zhao
Yuhang Zang
Haodong Duan
Xiaoyi Dong
Pan Zhang
Y. Cao
D. Lin
Jiaqi Wang
OffRL
56
1
0
10 Apr 2025
Capybara-OMNI: An Efficient Paradigm for Building Omni-Modal Language Models
Xingguang Ji
Jiakang Wang
Hongzhi Zhang
Jingyuan Zhang
Haonan Zhou
Chenxi Sun
Y. Liu
Qi Wang
Fuzheng Zhang
MLLM
VLM
58
0
0
10 Apr 2025
VideoExpert: Augmented LLM for Temporal-Sensitive Video Understanding
Henghao Zhao
Ge-Peng Ji
Rui Yan
Huan Xiong
Zechao Li
21
0
0
10 Apr 2025
TokenFocus-VQA: Enhancing Text-to-Image Alignment with Position-Aware Focus and Multi-Perspective Aggregations on LVLMs
Zijian Zhang
Xuhui Zheng
X. Wu
Chong Peng
Xuezhi Cao
30
0
0
10 Apr 2025
VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning
Haozhe Wang
C. Qu
Zuming Huang
Wei Chu
Fangzhen Lin
Wenhu Chen
OffRL
ReLM
SyDa
LRM
VLM
72
1
0
10 Apr 2025
SF2T: Self-supervised Fragment Finetuning of Video-LLMs for Fine-Grained Understanding
Yangliu Hu
Zikai Song
Na Feng
Yawei Luo
Junqing Yu
Yi-Ping Phoebe Chen
Wei Yang
33
0
0
10 Apr 2025
Perception-R1: Pioneering Perception Policy with Reinforcement Learning
En Yu
Kangheng Lin
Liang Zhao
Jisheng Yin
Yana Wei
...
Zheng Ge
Xiangyu Zhang
Daxin Jiang
Jingyu Wang
Wenbing Tao
VLM
OffRL
LRM
35
0
0
10 Apr 2025
Marmot: Multi-Agent Reasoning for Multi-Object Self-Correcting in Improving Image-Text Alignment
Jiayang Sun
H. Wang
Jie Cao
Huaibo Huang
R. He
DiffM
71
0
0
10 Apr 2025
Previous
1
2
3
4
5
...
42
43
44
Next