ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2304.08485
  4. Cited By
Visual Instruction Tuning

Visual Instruction Tuning

17 April 2023
Haotian Liu
Chunyuan Li
Qingyang Wu
Yong Jae Lee
    SyDa
    VLM
    MLLM
ArXivPDFHTML

Papers citing "Visual Instruction Tuning"

50 / 3,212 papers shown
Title
MedAgent-Pro: Towards Multi-modal Evidence-based Medical Diagnosis via Reasoning Agentic Workflow
MedAgent-Pro: Towards Multi-modal Evidence-based Medical Diagnosis via Reasoning Agentic Workflow
Ziyue Wang
Junde Wu
Chang Han Low
Yueming Jin
LRM
57
1
0
21 Mar 2025
Beyond Semantics: Rediscovering Spatial Awareness in Vision-Language Models
Beyond Semantics: Rediscovering Spatial Awareness in Vision-Language Models
Jianing Qi
Jiawei Liu
Hao Tang
Zhigang Zhu
104
1
0
21 Mar 2025
A Vision Centric Remote Sensing Benchmark
A Vision Centric Remote Sensing Benchmark
Abduljaleel Adejumo
Faegheh Yeganli
Clifford Broni-Bediako
Aoran Xiao
Naoto Yokoya
Mennatullah Siam
64
0
0
20 Mar 2025
GraspCoT: Integrating Physical Property Reasoning for 6-DoF Grasping under Flexible Language Instructions
GraspCoT: Integrating Physical Property Reasoning for 6-DoF Grasping under Flexible Language Instructions
Xiaomeng Chu
Jiajun Deng
Guoliang You
Wei Liu
X. Li
Jianmin Ji
Y. Zhang
77
0
0
20 Mar 2025
VerbDiff: Text-Only Diffusion Models with Enhanced Interaction Awareness
VerbDiff: Text-Only Diffusion Models with Enhanced Interaction Awareness
SeungJu Cha
Kwanyoung Lee
Ye-Chan Kim
Hyunwoo Oh
Dong-Jin Kim
48
0
0
20 Mar 2025
Enhancing Zero-Shot Image Recognition in Vision-Language Models through Human-like Concept Guidance
Enhancing Zero-Shot Image Recognition in Vision-Language Models through Human-like Concept Guidance
Hui Liu
Wenya Wang
Kecheng Chen
Jie Liu
Yibing Liu
Tiexin Qin
Peisong He
Xinghao Jiang
Haoliang Li
BDL
VLM
153
0
0
20 Mar 2025
UMIT: Unifying Medical Imaging Tasks via Vision-Language Models
UMIT: Unifying Medical Imaging Tasks via Vision-Language Models
Haiyang Yu
Siyang Yi
Ke Niu
Minghan Zhuo
Bin Li
LM&MA
45
0
0
20 Mar 2025
BadToken: Token-level Backdoor Attacks to Multi-modal Large Language Models
BadToken: Token-level Backdoor Attacks to Multi-modal Large Language Models
Zenghui Yuan
Jiawen Shi
Pan Zhou
Neil Zhenqiang Gong
Lichao Sun
AAML
66
1
0
20 Mar 2025
EDEN: Enhanced Diffusion for High-quality Large-motion Video Frame Interpolation
EDEN: Enhanced Diffusion for High-quality Large-motion Video Frame Interpolation
Zihao Zhang
Haoran Chen
Haoyu Zhao
Guansong Lu
Yanwei Fu
Hang Xu
Zuxuan Wu
VGen
DiffM
62
0
0
20 Mar 2025
IRef-VLA: A Benchmark for Interactive Referential Grounding with Imperfect Language in 3D Scenes
IRef-VLA: A Benchmark for Interactive Referential Grounding with Imperfect Language in 3D Scenes
Haochen Zhang
Nader Zantout
Pujith Kachana
Ji Zhang
Wenshan Wang
VGen
51
0
0
20 Mar 2025
MASH-VLM: Mitigating Action-Scene Hallucination in Video-LLMs through Disentangled Spatial-Temporal Representations
MASH-VLM: Mitigating Action-Scene Hallucination in Video-LLMs through Disentangled Spatial-Temporal Representations
Kyungho Bae
Jinhyung Kim
Sihaeng Lee
Soonyoung Lee
G. Lee
Jinwoo Choi
62
1
0
20 Mar 2025
TruthLens: Explainable DeepFake Detection for Face Manipulated and Fully Synthetic Data
TruthLens: Explainable DeepFake Detection for Face Manipulated and Fully Synthetic Data
Rohit Kundu
Athula Balachandran
A. Roy-Chowdhury
45
0
0
20 Mar 2025
Video-VoT-R1: An efficient video inference model integrating image packing and AoE architecture
Video-VoT-R1: An efficient video inference model integrating image packing and AoE architecture
Cheng Li
Jiexiong Liu
Yixuan Chen
Yanqin Jia
MLLM
VLM
76
0
0
20 Mar 2025
Vision-Speech Models: Teaching Speech Models to Converse about Images
Vision-Speech Models: Teaching Speech Models to Converse about Images
Amélie Royer
Moritz Böhle
Gabriel de Marmiesse
Laurent Mazaré
Neil Zeghidour
Alexandre Défossez
P. Pérez
AuLLM
VLM
84
0
0
19 Mar 2025
Visual Position Prompt for MLLM based Visual Grounding
Visual Position Prompt for MLLM based Visual Grounding
Wei Tang
Yanpeng Sun
Qinying Gu
Zechao Li
VLM
47
0
0
19 Mar 2025
Mitigating Object Hallucinations in MLLMs via Multi-Frequency Perturbations
Mitigating Object Hallucinations in MLLMs via Multi-Frequency Perturbations
Shuo Li
Jiajun Sun
Guodong Zheng
Xiaoran Fan
Yujiong Shen
...
Wenming Tan
Tao Ji
Tao Gui
Qi Zhang
Xuanjing Huang
AAML
VLM
85
0
0
19 Mar 2025
1000 Layer Networks for Self-Supervised RL: Scaling Depth Can Enable New Goal-Reaching Capabilities
1000 Layer Networks for Self-Supervised RL: Scaling Depth Can Enable New Goal-Reaching Capabilities
Kevin Wang
Ishaan Javali
Michał Bortkiewicz
Tomasz Trzciñski
Benjamin Eysenbach
SSL
OffRL
67
0
0
19 Mar 2025
Aligning Information Capacity Between Vision and Language via Dense-to-Sparse Feature Distillation for Image-Text Matching
Aligning Information Capacity Between Vision and Language via Dense-to-Sparse Feature Distillation for Image-Text Matching
Yang Liu
Wentao Feng
Zhuoyao Liu
Shudong Huang
Jiancheng Lv
DiffM
VLM
53
0
0
19 Mar 2025
Unlocking the Capabilities of Vision-Language Models for Generalizable and Explainable Deepfake Detection
Unlocking the Capabilities of Vision-Language Models for Generalizable and Explainable Deepfake Detection
Peipeng Yu
Jianwei Fei
Hui Gao
Xuan Feng
Zhihua Xia
Chip-Hong Chang
MLLM
VLM
81
1
0
19 Mar 2025
Universal Scene Graph Generation
Universal Scene Graph Generation
Shengqiong Wu
Hao Fei
Tat-Seng Chua
38
0
0
19 Mar 2025
Visual Persona: Foundation Model for Full-Body Human Customization
Visual Persona: Foundation Model for Full-Body Human Customization
Jisu Nam
Soowon Son
Zhan Xu
Jing Shi
Difan Liu
Feng Liu
Aashish Misraa
Seungryong Kim
Yang Zhou
DiffM
44
0
0
19 Mar 2025
TULIP: Towards Unified Language-Image Pretraining
TULIP: Towards Unified Language-Image Pretraining
Zineng Tang
Long Lian
Seun Eisape
Xudong Wang
Roei Herzig
Adam Yala
Alane Suhr
Trevor Darrell
David M. Chan
VLM
CLIP
MLLM
103
3
0
19 Mar 2025
Optimal Transport Adapter Tuning for Bridging Modality Gaps in Few-Shot Remote Sensing Scene Classification
Optimal Transport Adapter Tuning for Bridging Modality Gaps in Few-Shot Remote Sensing Scene Classification
Zhong Ji
Ci Liu
Jingren Liu
Chen Tang
Yanwei Pang
X. Li
OT
51
0
0
19 Mar 2025
Intelligent Spatial Perception by Building Hierarchical 3D Scene Graphs for Indoor Scenarios with the Help of LLMs
Intelligent Spatial Perception by Building Hierarchical 3D Scene Graphs for Indoor Scenarios with the Help of LLMs
Yao Cheng
Zhe Han
Fengyang Jiang
Huaizhen Wang
Fengyu Zhou
Qingshan Yin
Lei Wei
3DV
44
1
0
19 Mar 2025
POSTA: A Go-to Framework for Customized Artistic Poster Generation
POSTA: A Go-to Framework for Customized Artistic Poster Generation
Haoyu Chen
Xiaojie Xu
Wenbo Li
Jingjing Ren
Tian Ye
Songhua Liu
Ying Chen
Lei Zhu
Xinchao Wang
DiffM
57
1
0
19 Mar 2025
VisNumBench: Evaluating Number Sense of Multimodal Large Language Models
VisNumBench: Evaluating Number Sense of Multimodal Large Language Models
Tengjin Weng
Jingyi Wang
Wenhao Jiang
Zhong Ming
VLM
LRM
49
0
0
19 Mar 2025
A Review on Large Language Models for Visual Analytics
A Review on Large Language Models for Visual Analytics
Navya Sonal Agarwal
Sanjay Kumar Sonbhadra
43
0
0
19 Mar 2025
EgoDTM: Towards 3D-Aware Egocentric Video-Language Pretraining
EgoDTM: Towards 3D-Aware Egocentric Video-Language Pretraining
Boshen Xu
Yuting Mei
Xinbi Liu
Sipeng Zheng
Qin Jin
VLM
MDE
65
0
0
19 Mar 2025
MathFlow: Enhancing the Perceptual Flow of MLLMs for Visual Mathematical Problems
MathFlow: Enhancing the Perceptual Flow of MLLMs for Visual Mathematical Problems
Felix Chen
Hangjie Yuan
Yunqiu Xu
Tao Feng
Jun Cen
Pengwei Liu
Zeying Huang
Yi Yang
LRM
44
1
0
19 Mar 2025
EfficientLLaVA:Generalizable Auto-Pruning for Large Vision-language Models
EfficientLLaVA:Generalizable Auto-Pruning for Large Vision-language Models
Yinan Liang
Z. Wang
Xiuwei Xu
Jie Zhou
Jiwen Lu
VLM
LRM
48
0
0
19 Mar 2025
Cube: A Roblox View of 3D Intelligence
Cube: A Roblox View of 3D Intelligence
Foundation AI Team Roblox
Kiran Bhat
Nishchaie Khanna
Karun Channa
Tinghui Zhou
...
Kyle Price
Steve Han
Yiqing Wang
A. Singh
David Baszucki
61
0
0
19 Mar 2025
UPME: An Unsupervised Peer Review Framework for Multimodal Large Language Model Evaluation
UPME: An Unsupervised Peer Review Framework for Multimodal Large Language Model Evaluation
Qihui Zhang
Munan Ning
Zheyuan Liu
Yanbo Wang
Jiayi Ye
Yue Huang
Shuo Yang
Xiao Chen
Y. Song
Li Yuan
LRM
56
0
0
19 Mar 2025
ChatBEV: A Visual Language Model that Understands BEV Maps
ChatBEV: A Visual Language Model that Understands BEV Maps
Qingyao Xu
S. Chen
Guang Chen
Yanfeng Wang
Y. Zhang
46
0
0
18 Mar 2025
Impossible Videos
Impossible Videos
Zechen Bai
Hai Ci
Mike Zheng Shou
EGVM
VGen
72
0
0
18 Mar 2025
DualToken: Towards Unifying Visual Understanding and Generation with Dual Visual Vocabularies
DualToken: Towards Unifying Visual Understanding and Generation with Dual Visual Vocabularies
Wei Song
Y. Wang
Zijia Song
Yadong Li
Haoze Sun
Weipeng Chen
Zenan Zhou
Jianhua Xu
Jiaqi Wang
Kaicheng Yu
60
2
0
18 Mar 2025
MMR: A Large-scale Benchmark Dataset for Multi-target and Multi-granularity Reasoning Segmentation
MMR: A Large-scale Benchmark Dataset for Multi-target and Multi-granularity Reasoning Segmentation
Donggon Jang
Yucheol Cho
Suin Lee
Taehyeon Kim
Dae-Shik Kim
VLM
65
1
0
18 Mar 2025
VisEscape: A Benchmark for Evaluating Exploration-driven Decision-making in Virtual Escape Rooms
VisEscape: A Benchmark for Evaluating Exploration-driven Decision-making in Virtual Escape Rooms
Seungwon Lim
Sungwoong Kim
Jihwan Yu
Sungjae Lee
Jiwan Chung
Youngjae Yu
69
1
0
18 Mar 2025
Improving LLM Video Understanding with 16 Frames Per Second
Improving LLM Video Understanding with 16 Frames Per Second
Y. Li
Changli Tang
Jimin Zhuang
Yudong Yang
Guangzhi Sun
W. Li
Z. Ma
Chao Zhang
VLM
75
1
0
18 Mar 2025
Growing a Twig to Accelerate Large Vision-Language Models
Growing a Twig to Accelerate Large Vision-Language Models
Zhenwei Shao
Mingyang Wang
Zhou Yu
Wenwen Pan
Yan Yang
Tao Wei
H. Zhang
Ning Mao
Wei Chen
Jun Yu
VLM
59
1
0
18 Mar 2025
Can Large Vision Language Models Read Maps Like a Human?
Can Large Vision Language Models Read Maps Like a Human?
Shuo Xing
Zezhou Sun
Shuangyu Xie
Kaiyuan Chen
Yanjia Huang
Yuping Wang
Jiachen Li
Dezhen Song
Zhengzhong Tu
65
2
0
18 Mar 2025
Tracking Meets Large Multimodal Models for Driving Scenario Understanding
Tracking Meets Large Multimodal Models for Driving Scenario Understanding
Ayesha Ishaq
Jean Lahoud
F. Khan
Salman Khan
Hisham Cholakkal
Rao Muhammad Anwer
54
0
0
18 Mar 2025
Identifying and Mitigating Position Bias of Multi-image Vision-Language Models
Identifying and Mitigating Position Bias of Multi-image Vision-Language Models
Xinyu Tian
Shu Zou
Zhaoyuan Yang
Jing Zhang
60
0
0
18 Mar 2025
LED: LLM Enhanced Open-Vocabulary Object Detection without Human Curated Data Generation
LED: LLM Enhanced Open-Vocabulary Object Detection without Human Curated Data Generation
Yang Zhou
Shiyu Zhao
Y. Chen
Z. Wang
Dimitris N. Metaxas
ObjD
56
0
0
18 Mar 2025
Creation-MMBench: Assessing Context-Aware Creative Intelligence in MLLM
Creation-MMBench: Assessing Context-Aware Creative Intelligence in MLLM
Xinyu Fang
Z. Chen
Kai Lan
Lixin Ma
Shengyuan Ding
...
Zicheng Zhang
Guofeng Zhang
Haodong Duan
K. Chen
D. Lin
MLLM
58
1
0
18 Mar 2025
Marten: Visual Question Answering with Mask Generation for Multi-modal Document Understanding
Marten: Visual Question Answering with Mask Generation for Multi-modal Document Understanding
Zining Wang
Tongkun Guan
Pei Fu
Chen Duan
Qianyi Jiang
Zhentao Guo
Shan Guo
Junfeng Luo
Wei-Ming Shen
Xiaokang Yang
MLLM
VLM
69
0
0
18 Mar 2025
EIAD: Explainable Industrial Anomaly Detection Via Multi-Modal Large Language Models
EIAD: Explainable Industrial Anomaly Detection Via Multi-Modal Large Language Models
Zongyun Zhang
Jiacheng Ruan
Xian Gao
Ting Liu
Yuzhuo Fu
70
2
0
18 Mar 2025
Cosmos-Reason1: From Physical Common Sense To Embodied Reasoning
Cosmos-Reason1: From Physical Common Sense To Embodied Reasoning
Nvidia
A. Azzolini
Hannah Brandon
Prithvijit Chattopadhyay
Huayu Chen
...
Yao Xu
X. Yang
Zhuolin Yang
Xiaohui Zeng
Z. Zhang
LM&Ro
LRM
AI4CE
52
5
0
18 Mar 2025
The Power of Context: How Multimodality Improves Image Super-Resolution
The Power of Context: How Multimodality Improves Image Super-Resolution
Kangfu Mei
Hossein Talebi
Mojtaba Ardakani
Vishal M. Patel
P. Milanfar
M. Delbracio
DiffM
77
1
0
18 Mar 2025
Med-R1: Reinforcement Learning for Generalizable Medical Reasoning in Vision-Language Models
Med-R1: Reinforcement Learning for Generalizable Medical Reasoning in Vision-Language Models
Yuxiang Lai
Jike Zhong
Ming Li
Shitian Zhao
Xiaofeng Yang
OffRL
LRM
LM&MA
VLM
73
7
0
18 Mar 2025
MP-GUI: Modality Perception with MLLMs for GUI Understanding
MP-GUI: Modality Perception with MLLMs for GUI Understanding
Ziwei Wang
Weizhi Chen
Leyang Yang
Sheng Zhou
Shengchu Zhao
Hanbei Zhan
Jiongchao Jin
Liangcheng Li
Zirui Shao
Jiajun Bu
60
1
0
18 Mar 2025
Previous
123...8910...636465
Next