Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2304.08485
Cited By
Visual Instruction Tuning
17 April 2023
Haotian Liu
Chunyuan Li
Qingyang Wu
Yong Jae Lee
SyDa
VLM
MLLM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Visual Instruction Tuning"
50 / 2,152 papers shown
Title
Air-Ground Collaboration for Language-Specified Missions in Unknown Environments
Fernando Cladera
Zachary Ravichandran
Jason Hughes
Varun Murali
Carlos Nieto-Granda
M. Hsieh
George J. Pappas
Camillo J. Taylor
Vijay R. Kumar
11
0
0
14 May 2025
Zero-shot Quantization: A Comprehensive Survey
Minjun Kim
Jaehyeon Choi
Jongkeun Lee
Wonjin Cho
U. Kang
MQ
7
0
0
14 May 2025
Real2Render2Real: Scaling Robot Data Without Dynamics Simulation or Robot Hardware
Justin Yu
Letian Fu
Huang Huang
Karim El-Refai
Rares Ambrus
Richard Cheng
Muhammad Zubair Irshad
Ken Goldberg
7
0
0
14 May 2025
From Seeing to Doing: Bridging Reasoning and Decision for Robotic Manipulation
Yifu Yuan
Haiqin Cui
Yibin Chen
Zibin Dong
Fei Ni
Longxin Kou
Jinyi Liu
Pengyi Li
Yan Zheng
Jianye Hao
19
0
0
13 May 2025
OpenThinkIMG: Learning to Think with Images via Visual Tool Reinforcement Learning
Zhaochen Su
Linjie Li
Mingyang Song
Yunzhuo Hao
Zhengyuan Yang
...
Guanjie Chen
Jiawei Gu
Juntao Li
Xiaoye Qu
Yu Cheng
OffRL
LRM
16
0
0
13 May 2025
DSADF: Thinking Fast and Slow for Decision Making
Alex Zhihao Dou
Dongfei Cui
Jun Yan
W. Wang
Benteng Chen
Haoming Wang
Zeke Xie
Shufei Zhang
OffRL
17
0
0
13 May 2025
ORACLE-Grasp: Zero-Shot Task-Oriented Robotic Grasping using Large Multimodal Models
Avihai Giuili
Rotem Atari
A. Sintov
VLM
17
0
0
13 May 2025
CLTP: Contrastive Language-Tactile Pre-training for 3D Contact Geometry Understanding
Wenxuan Ma
Xiaoge Cao
Y. Zhang
Chaofan Zhang
Shaobo Yang
Peng Hao
Bin Fang
Yinghao Cai
Shaowei Cui
Shuo Wang
17
0
0
13 May 2025
Ultra Lowrate Image Compression with Semantic Residual Coding and Compression-aware Diffusion
Anle Ke
Xu Zhang
Tong Chen
Ming-Tse Lu
Chao Zhou
Jiawen Gu
Zhan Ma
DiffM
20
0
0
13 May 2025
Prioritizing Image-Related Tokens Enhances Vision-Language Pre-Training
Y. Chen
Hao Peng
Tong Zhang
Heng Ji
VLM
9
0
0
13 May 2025
STORYANCHORS: Generating Consistent Multi-Scene Story Frames for Long-Form Narratives
Bo Wang
Haoyang Huang
Zhiyin Lu
F. Liu
Guoqing Ma
Jianlong Yuan
Y. Zhang
Nan Duan
VGen
14
0
0
13 May 2025
Visually Guided Decoding: Gradient-Free Hard Prompt Inversion with Language Models
Donghoon Kim
Minji Bae
Kyuhong Shim
B. Shim
26
0
0
13 May 2025
An integrated language-vision foundation model for conversational diagnostics and triaging in primary eye care
Z. Soh
Yang Bai
Kai Yu
Yang Zhou
Xiaofeng Lei
...
J. Jonas
T. Y. Wong
Rick Siow Mong Goh
Yong Liu
Ching-Yu Cheng
11
0
0
13 May 2025
Extending Large Vision-Language Model for Diverse Interactive Tasks in Autonomous Driving
Zongchuang Zhao
Haoyu Fu
Dingkang Liang
Xin Zhou
Dingyuan Zhang
Hongwei Xie
Bing Wang
Xiang Bai
MLLM
VLM
39
0
0
13 May 2025
DexWild: Dexterous Human Interactions for In-the-Wild Robot Policies
Tony Tao
M. K. Srirama
Jason Jingzhou Liu
Kenneth Shaw
Deepak Pathak
21
0
0
12 May 2025
QuantX: A Framework for Hardware-Aware Quantization of Generative AI Workloads
Khurram Mazher
Saad Bin Nasir
MQ
37
0
0
12 May 2025
Critique Before Thinking: Mitigating Hallucination through Rationale-Augmented Instruction Tuning
Zexian Yang
Dian Li
Dayan Wu
Gang Liu
Weiping Wang
MLLM
LRM
36
0
0
12 May 2025
Visually Interpretable Subtask Reasoning for Visual Question Answering
Yu Cheng
A. Goel
Hakan Bilen
LRM
24
0
0
12 May 2025
Visual Instruction Tuning with Chain of Region-of-Interest
Yixin Chen
Shuai Zhang
Boran Han
Bernie Wang
21
0
0
11 May 2025
Multi-Modal Explainable Medical AI Assistant for Trustworthy Human-AI Collaboration
Honglong Yang
Shanshan Song
Yi Qin
Lehan Wang
Haonan Wang
Xinpeng Ding
Qixiang Zhang
Bodong Du
X. Li
LM&MA
24
0
0
11 May 2025
Hallucination-Aware Multimodal Benchmark for Gastrointestinal Image Analysis with Large Vision-Language Models
Bidur Khanal
Sandesh Pokhrel
Sanjay Bhandari
Ramesh Rana
Nikesh Shrestha
Ram Bahadur Gurung
Cristian A. Linte
Angus Watson
Y. Shrestha
Binod Bhattarai
VLM
26
0
0
11 May 2025
DriveSOTIF: Advancing Perception SOTIF Through Multimodal Large Language Models
Shucheng Huang
Freda Shi
Chen Sun
Jiaming Zhong
Minghao Ning
Yufeng Yang
Yukun Lu
Hong Wang
A. Khajepour
19
0
0
11 May 2025
Exploring Multimodal Foundation AI and Expert-in-the-Loop for Sustainable Management of Wild Salmon Fisheries in Indigenous Rivers
Chi Xu
Yili Jin
Sami Ma
Rongsheng Qian
Hao Fang
...
Xue Liu
Edith Ngai
William I. Atlas
Katrina M. Connors
Mark A. Spoljaric
21
0
0
10 May 2025
The ML.ENERGY Benchmark: Toward Automated Inference Energy Measurement and Optimization
Jae-Won Chung
Jiachen Liu
Jeff J. Ma
Ruofan Wu
Oh Jun Kweon
Yuxuan Xia
Zhiyu Wu
Mosharaf Chowdhury
21
0
0
09 May 2025
FG-CLIP: Fine-Grained Visual and Textual Alignment
Chunyu Xie
Bin Wang
Fanjing Kong
Jincheng Li
Dawei Liang
Gengshen Zhang
Dawei Leng
Yuhui Yin
CLIP
VLM
42
0
0
08 May 2025
TokLIP: Marry Visual Tokens to CLIP for Multimodal Comprehension and Generation
Haokun Lin
Teng Wang
Yixiao Ge
Yuying Ge
Zhichao Lu
Ying Wei
Qingfu Zhang
Zhenan Sun
Ying Shan
MLLM
VLM
64
0
0
08 May 2025
PlaceIt3D: Language-Guided Object Placement in Real 3D Scenes
Ahmed Abdelreheem
Filippo Aleotti
Jamie Watson
Z. Qureshi
Abdelrahman Eldesokey
Peter Wonka
Gabriel J. Brostow
Sara Vicente
Guillermo Garcia-Hernando
DiffM
50
0
0
08 May 2025
FLAM: Frame-Wise Language-Audio Modeling
Yusong Wu
Christos Tsirigotis
Ke Chen
Cheng-Zhi Anna Huang
Aaron C. Courville
Oriol Nieto
Prem Seetharaman
Justin Salamon
43
0
0
08 May 2025
Adaptive Markup Language Generation for Contextually-Grounded Visual Document Understanding
Han Xiao
Yina Xie
Guanxin Tan
Yinghao Chen
R. Hu
...
Peng Gao
Yafei Wen
Xiaoxin Chen
Shuai Ren
Hongsheng Li
VLM
40
0
0
08 May 2025
X-Transfer Attacks: Towards Super Transferable Adversarial Attacks on CLIP
Hanxun Huang
Sarah Monazam Erfani
Yige Li
Xingjun Ma
James Bailey
AAML
34
0
0
08 May 2025
StreamBridge: Turning Your Offline Video Large Language Model into a Proactive Streaming Assistant
Haibo Wang
Bo Feng
Zhengfeng Lai
Mingze Xu
Shiyu Li
Weifeng Ge
Afshin Dehghan
Meng Cao
Ping-Chia Huang
OffRL
49
3
0
08 May 2025
Looking Beyond Language Priors: Enhancing Visual Comprehension and Attention in Multimodal Models
Aarti Ghatkesar
Uddeshya Upadhyay
Ganesh Venkatesh
VLM
31
0
0
08 May 2025
Collaborative Multi-LoRA Experts with Achievement-based Multi-Tasks Loss for Unified Multimodal Information Extraction
Li Yuan
Yi Cai
Xudong Shen
Qing Li
Qingbao Huang
Zikun Deng
Tao Wang
MoMe
OffRL
MoE
36
0
0
08 May 2025
MonoCoP: Chain-of-Prediction for Monocular 3D Object Detection
Zhihao Zhang
Abhinav Kumar
Girish Chandar Ganesan
Xiaoming Liu
92
0
0
07 May 2025
R^3-VQA: "Read the Room" by Video Social Reasoning
Lixing Niu
Jiapeng Li
Xingping Yu
Shu Wang
Ruining Feng
Bo Wu
Ping Wei
Y. Wang
Lifeng Fan
43
0
0
07 May 2025
OpenVision: A Fully-Open, Cost-Effective Family of Advanced Vision Encoders for Multimodal Learning
Xianhang Li
Y. Liu
Haoqin Tu
Hongru Zhu
Cihang Xie
VLM
76
0
0
07 May 2025
HunyuanCustom: A Multimodal-Driven Architecture for Customized Video Generation
Teng Hu
Zhentao Yu
Zhengguang Zhou
Sen Liang
Yuan Zhou
Qin Lin
Qinglin Lu
DiffM
VGen
52
0
0
07 May 2025
VideoPath-LLaVA: Pathology Diagnostic Reasoning Through Video Instruction Tuning
T. Vuong
J. T. Kwak
VGen
32
0
0
07 May 2025
Reinforced Correlation Between Vision and Language for Precise Medical AI Assistant
Haonan Wang
Jiaji Mao
Lehan Wang
Qixiang Zhang
Marawan Elbatel
...
Weifeng Qin
H. Li
Jialin Liang
Jun Shen
Xiaomeng Li
MedIm
26
0
0
06 May 2025
Reducing Annotation Burden in Physical Activity Research Using Vision-Language Models
Abram Schonfeldt
Benjamin Maylor
Xiaofang Chen
Ronald Clark
Aiden Doherty
62
0
0
06 May 2025
OpenHelix: A Short Survey, Empirical Analysis, and Open-Source Dual-System VLA Model for Robotic Manipulation
Can Cui
Pengxiang Ding
Wenxuan Song
Shuanghao Bai
Xinyang Tong
...
Yang Liu
Bofang Jia
H. Zhao
Siteng Huang
Donglin Wang
14
1
0
06 May 2025
SynSHRP2: A Synthetic Multimodal Benchmark for Driving Safety-critical Events Derived from Real-world Driving Data
Liang Shi
Boyu Jiang
Zhenyuan Yuan
Miguel A. Perez
Feng Guo
24
0
0
06 May 2025
Mitigating Image Captioning Hallucinations in Vision-Language Models
Fei Zhao
C. Zhang
Runlin Zhang
Tianyang Wang
Xi Li
VLM
37
0
0
06 May 2025
A Vision-Language Model for Focal Liver Lesion Classification
Song Jian
Hu Yuchang
Wang Hui
Chen Yen-Wei
VLM
MedIm
36
0
0
06 May 2025
VISLIX: An XAI Framework for Validating Vision Models with Slice Discovery and Analysis
Xinyuan Yan
Xiwei Xuan
Jorge Henrique Piazentin Ono
Jiajing Guo
V. Mohanty
Shekar Arvind Kumar
Liang Gou
Bei Wang
Liu Ren
30
0
0
06 May 2025
VLM Q-Learning: Aligning Vision-Language Models for Interactive Decision-Making
Jake Grigsby
Yuke Zhu
Michael S Ryoo
Juan Carlos Niebles
OffRL
VLM
34
0
0
06 May 2025
Token Communication-Driven Multimodal Large Models in Resource-Constrained Multiuser Networks
Junhe Zhang
Wanli Ni
Pengwei Wang
Dongyu Wang
14
0
0
06 May 2025
Task-Oriented Semantic Communication in Large Multimodal Models-based Vehicle Networks
Baoxia Du
H. Du
Dusit Niyato
Ruidong Li
51
0
0
05 May 2025
Structure Causal Models and LLMs Integration in Medical Visual Question Answering
Zibo Xu
Qiang Li
Weizhi Nie
Weijie Wang
Anan Liu
CML
MedIm
40
0
0
05 May 2025
Unified Multimodal Understanding and Generation Models: Advances, Challenges, and Opportunities
X. Zhang
Jintao Guo
Shanshan Zhao
Minghao Fu
Lunhao Duan
Guo-Hua Wang
Qing-Guo Chen
Zhao Xu
Weihua Luo
Kaifu Zhang
DiffM
60
0
0
05 May 2025
1
2
3
4
...
42
43
44
Next