Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2304.08485
Cited By
Visual Instruction Tuning
17 April 2023
Haotian Liu
Chunyuan Li
Qingyang Wu
Yong Jae Lee
SyDa
VLM
MLLM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Visual Instruction Tuning"
50 / 2,160 papers shown
Title
UniViTAR: Unified Vision Transformer with Native Resolution
Limeng Qiao
Yiyang Gan
Bairui Wang
Jie Qin
Shuang Xu
Siqi Yang
Lin Ma
50
0
0
02 Apr 2025
ANNEXE: Unified Analyzing, Answering, and Pixel Grounding for Egocentric Interaction
Yuejiao Su
Yi Wang
Qiongyang Hu
Chuang Yang
Lap-Pui Chau
45
0
0
02 Apr 2025
PiCo: Jailbreaking Multimodal Large Language Models via
Pi
\textbf{Pi}
Pi
ctorial
Co
\textbf{Co}
Co
de Contextualization
Aofan Liu
Lulu Tang
Ting Pan
Yuguo Yin
Bin Wang
Ao Yang
MLLM
AAML
40
0
0
02 Apr 2025
Prompting Medical Vision-Language Models to Mitigate Diagnosis Bias by Generating Realistic Dermoscopic Images
Nusrat Munia
Abdullah-Al-Zubaer Imran
LM&MA
MedIm
32
0
0
02 Apr 2025
Reasoning LLMs for User-Aware Multimodal Conversational Agents
Hamed Rahimi
Jeanne Cattoni
Meriem Beghili
Mouad Abrini
Mahdi Khoramshahi
Maribel Pino
Mohamed Chetouani
LRM
34
0
0
02 Apr 2025
ILLUME+: Illuminating Unified MLLM with Dual Visual Tokenization and Diffusion Refinement
Runhui Huang
Chunwei Wang
Junwei Yang
Guansong Lu
Yunlong Yuan
...
Lu Hou
Wei Zhang
Lanqing Hong
Hengshuang Zhao
Hang Xu
MLLM
81
2
0
02 Apr 2025
AdPO: Enhancing the Adversarial Robustness of Large Vision-Language Models with Preference Optimization
Chaohu Liu
Tianyi Gui
Yu Liu
Linli Xu
VLM
AAML
68
1
0
02 Apr 2025
Enhanced Cross-modal 3D Retrieval via Tri-modal Reconstruction
Junlong Ren
Hao Wang
36
0
0
02 Apr 2025
Safeguarding Vision-Language Models: Mitigating Vulnerabilities to Gaussian Noise in Perturbation-based Attacks
Jiawei Wang
Yushen Zuo
Yuanjun Chai
Z. Liu
Yichen Fu
Yichun Feng
Kin-Man Lam
AAML
VLM
40
0
0
02 Apr 2025
Slow-Fast Architecture for Video Multi-Modal Large Language Models
Min Shi
Shihao Wang
Chieh-Yun Chen
Jitesh Jain
Kai Wang
Junjun Xiong
Guilin Liu
Zhiding Yu
Humphrey Shi
31
1
0
02 Apr 2025
TimeSearch: Hierarchical Video Search with Spotlight and Reflection for Human-like Long Video Understanding
Junwen Pan
Rui Zhang
Xin Wan
Yuan Zhang
Ming Lu
Qi She
VLM
36
1
0
02 Apr 2025
Towards Unified Referring Expression Segmentation Across Omni-Level Visual Target Granularities
Jing Liu
Wenxuan Wang
Yisi Zhang
Yepeng Tang
Xingjian He
Longteng Guo
Tongtian Yue
Xinlong Wang
ObjD
46
0
0
02 Apr 2025
WorldPrompter: Traversable Text-to-Scene Generation
Zhaoyang Zhang
Yannick Hold-Geoffroy
Miloš Hašan
Chen Ziwen
Fujun Luan
Julie Dorsey
Yiwei Hu
VGen
48
0
0
02 Apr 2025
Multimodal Reference Visual Grounding
Yangxiao Lu
Ruosen Li
Liqiang Jing
Jikai Wang
Xinya Du
Yunhui Guo
Nicholas Ruozzi
Yu Xiang
ObjD
76
0
0
02 Apr 2025
Aligned Better, Listen Better for Audio-Visual Large Language Models
Yuxin Guo
Shuailei Ma
Shijie Ma
Xiaoyi Bao
Chen-Wei Xie
Kecheng Zheng
Tingyu Weng
Siyang Sun
Yun Zheng
Wei Zou
MLLM
AuLLM
58
2
0
02 Apr 2025
Recitation over Reasoning: How Cutting-Edge Language Models Can Fail on Elementary School-Level Reasoning Problems?
Kai Yan
Yufei Xu
Zhengyin Du
Xuesong Yao
Z. Wang
Xiaowen Guo
Jiecao Chen
ReLM
ELM
LRM
92
3
0
01 Apr 2025
ShieldGemma 2: Robust and Tractable Image Content Moderation
Wenjun Zeng
D. Kurniawan
Ryan Mullins
Yuchi Liu
Tamoghna Saha
...
Mani Malek
Hamid Palangi
Joon Baek
Rick Pereira
Karthik Narasimhan
AI4MH
31
0
0
01 Apr 2025
Improved Visual-Spatial Reasoning via R1-Zero-Like Training
Zhenyi Liao
Qingsong Xie
Yanhao Zhang
Zijian Kong
Haonan Lu
Zhenyu Yang
Zhijie Deng
ReLM
VLM
LRM
101
0
1
01 Apr 2025
Scaling Language-Free Visual Representation Learning
David Fan
Shengbang Tong
Jiachen Zhu
Koustuv Sinha
Zhuang Liu
...
Michael G. Rabbat
Nicolas Ballas
Yann LeCun
Amir Bar
Saining Xie
CLIP
VLM
56
2
0
01 Apr 2025
POPEN: Preference-Based Optimization and Ensemble for LVLM-Based Reasoning Segmentation
Lanyun Zhu
Tianrun Chen
Qianxiong Xu
Xuanyi Liu
Deyi Ji
Haiyang Wu
De Wen Soh
J. Liu
VLM
LRM
44
0
0
01 Apr 2025
Scaling Prompt Instructed Zero Shot Composed Image Retrieval with Image-Only Data
Yiqun Duan
Sameera Ramasinghe
Stephen Gould
Ajanthan Thalaiyasingam
43
0
0
01 Apr 2025
4th PVUW MeViS 3rd Place Report: Sa2VA
Haobo Yuan
Tao Zhang
X. Li
Lu Qi
Zilong Huang
Shilin Xu
Jiashi Feng
Ming Yang
33
1
0
01 Apr 2025
AI Judges in Design: Statistical Perspectives on Achieving Human Expert Equivalence With Vision-Language Models
Kristen M. Edwards
Farnaz Tehranchi
Scarlett R. Miller
Faez Ahmed
61
0
0
01 Apr 2025
Efficient LLaMA-3.2-Vision by Trimming Cross-attended Visual Features
Jewon Lee
Ki-Ung Song
Seungmin Yang
Donguk Lim
Jaeyeon Kim
Wooksu Shin
Bo-Kyeong Kim
Yong Jae Lee
Tae-Ho Kim
VLM
55
0
0
01 Apr 2025
IDMR: Towards Instance-Driven Precise Visual Correspondence in Multimodal Retrieval
Bangwei Liu
Yicheng Bao
Shaohui Lin
Xuhong Wang
Xin Tan
Y. Wang
Yuan Xie
Chaochao Lu
66
0
0
01 Apr 2025
Communication-Efficient and Personalized Federated Foundation Model Fine-Tuning via Tri-Matrix Adaptation
Y. Li
Bo Liu
Sheng Huang
Z. Zhang
Xiaotong Yuan
Richang Hong
41
0
0
31 Mar 2025
ExScene: Free-View 3D Scene Reconstruction with Gaussian Splatting from a Single Image
Tianyi Gong
Boyan Li
Yifei Zhong
Fangxin Wang
3DGS
VGen
42
0
0
31 Mar 2025
Consistent Subject Generation via Contrastive Instantiated Concepts
Lee Hsin-Ying
Kelvin Chan
Ming Yang
DiffM
95
0
0
31 Mar 2025
AirCache: Activating Inter-modal Relevancy KV Cache Compression for Efficient Large Vision-Language Model Inference
Kai Huang
Hao Zou
Bochen Wang
Ye Xi
Zhen Xie
Hao Wang
VLM
42
0
0
31 Mar 2025
HOIGen-1M: A Large-scale Dataset for Human-Object Interaction Video Generation
Kun Liu
Qi Liu
Xinchen Liu
Jie Li
Yongdong Zhang
Jiebo Luo
Xiaodong He
Wu Liu
VGen
35
0
0
31 Mar 2025
XLRS-Bench: Could Your Multimodal LLMs Understand Extremely Large Ultra-High-Resolution Remote Sensing Imagery?
Fengxiang Wang
H. Wang
Mingshuo Chen
Di Wang
Yulin Wang
...
L. Lan
Wenjing Yang
J. Zhang
Zhiyuan Liu
Maosong Sun
52
2
0
31 Mar 2025
Orchestrate Multimodal Data with Batch Post-Balancing to Accelerate Multimodal Large Language Model Training
Yijie Zheng
Bangjun Xiao
Lei Shi
Xiaoyang Li
Faming Wu
Tianyu Li
Xuefeng Xiao
Y. Zhang
Y. Wang
Shouda Liu
MLLM
MoE
67
1
0
31 Mar 2025
Towards Understanding How Knowledge Evolves in Large Vision-Language Models
Sudong Wang
Y. Zhang
Yao Zhu
Jianing Li
Zizhe Wang
Y. Liu
Xiangyang Ji
90
0
0
31 Mar 2025
Self-Evolving Visual Concept Library using Vision-Language Critics
Atharva Sehgal
Patrick Yuan
Ziniu Hu
Yisong Yue
Jennifer J. Sun
Swarat Chaudhuri
VLM
45
0
0
31 Mar 2025
Enhancing Image Resolution of Solar Magnetograms: A Latent Diffusion Model Approach
Francesco P. Ramunno
Paolo Massa
Vitaliy Kinakh
Brandon Panos
A. Csillaghy
S. Voloshynovskiy
DiffM
53
0
0
31 Mar 2025
FakeScope: Large Multimodal Expert Model for Transparent AI-Generated Image Forensics
Yixuan Li
Yu Tian
Yipo Huang
Wei Lu
Shiqi Wang
Weisi Lin
Anderson de Rezende Rocha
54
0
0
31 Mar 2025
KOFFVQA: An Objectively Evaluated Free-form VQA Benchmark for Large Vision-Language Models in the Korean Language
Yoonshik Kim
Jaeyoon Jung
37
0
0
31 Mar 2025
DeepDubber-V1: Towards High Quality and Dialogue, Narration, Monologue Adaptive Movie Dubbing Via Multi-Modal Chain-of-Thoughts Reasoning Guidance
Junjie Zheng
Zihao Chen
Chaofan Ding
Xinhan Di
VGen
67
1
0
31 Mar 2025
CA^2ST: Cross-Attention in Audio, Space, and Time for Holistic Video Recognition
Jongseo Lee
Joohyun Chang
Dongho Lee
Jinwoo Choi
51
0
0
30 Mar 2025
ReasonGrounder: LVLM-Guided Hierarchical Feature Splatting for Open-Vocabulary 3D Visual Grounding and Reasoning
Zhenyang Liu
Yikai Wang
Sixiao Zheng
Tongying Pan
Longfei Liang
Yanwei Fu
Xiangyang Xue
LRM
49
0
0
30 Mar 2025
ViT-Linearizer: Distilling Quadratic Knowledge into Linear-Time Vision Models
Guoyizhe Wei
Rama Chellappa
31
0
0
30 Mar 2025
EagleVision: Object-level Attribute Multimodal LLM for Remote Sensing
Hongxiang Jiang
Jihao Yin
Qixiong Wang
Jiaqi Feng
Guo Chen
48
0
0
30 Mar 2025
ViLAaD: Enhancing "Attracting and Dispersing'' Source-Free Domain Adaptation with Vision-and-Language Model
Shuhei Tarashima
Xinqi Shu
Norio Tagawa
VLM
46
0
0
30 Mar 2025
Physically Ground Commonsense Knowledge for Articulated Object Manipulation with Analytic Concepts
Jianhua Sun
Jiude Wei
Y. Li
Cewu Lu
LM&Ro
54
1
0
30 Mar 2025
Aurelia: Test-time Reasoning Distillation in Audio-Visual LLMs
Sanjoy Chowdhury
Hanan Gani
Nishit Anand
Sayan Nag
Ruohan Gao
Mohamed Elhoseiny
Salman Khan
Dinesh Manocha
LRM
40
0
0
29 Mar 2025
Empowering Large Language Models with 3D Situation Awareness
Zhihao Yuan
Yibo Peng
Jinke Ren
Yinghong Liao
Yatong Han
Chun-Mei Feng
Hengshuang Zhao
G. Li
Shuguang Cui
Zhen Li
51
0
0
29 Mar 2025
RefChartQA: Grounding Visual Answer on Chart Images through Instruction Tuning
Alexander Vogel
Omar Moured
Yufan Chen
Jiaming Zhang
Rainer Stiefelhagen
35
0
0
29 Mar 2025
From Flatland to Space: Teaching Vision-Language Models to Perceive and Reason in 3D
Jiahui Zhang
Yurui Chen
Yanpeng Zhou
Yueming Xu
Ze Huang
...
Xinyue Cai
G. Huang
Xingyue Quan
Hang Xu
Li Zhang
LRM
89
0
0
29 Mar 2025
Agent-Centric Personalized Multiple Clustering with Multi-Modal LLMs
Ziye Chen
Yiqun Duan
Riheng Zhu
Zhenbang Sun
Mingming Gong
35
0
0
28 Mar 2025
Unveiling the Mist over 3D Vision-Language Understanding: Object-centric Evaluation with Chain-of-Analysis
J. Huang
Baoxiong Jia
Y. Wang
Ziyu Zhu
Xiongkun Linghu
Qing Li
Song-Chun Zhu
Siyuan Huang
75
3
0
28 Mar 2025
Previous
1
2
3
...
5
6
7
...
42
43
44
Next