ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2304.08485
  4. Cited By
Visual Instruction Tuning

Visual Instruction Tuning

17 April 2023
Haotian Liu
Chunyuan Li
Qingyang Wu
Yong Jae Lee
    SyDa
    VLM
    MLLM
ArXivPDFHTML

Papers citing "Visual Instruction Tuning"

50 / 2,159 papers shown
Title
Marmot: Multi-Agent Reasoning for Multi-Object Self-Correcting in Improving Image-Text Alignment
Marmot: Multi-Agent Reasoning for Multi-Object Self-Correcting in Improving Image-Text Alignment
Jiayang Sun
H. Wang
Jie Cao
Huaibo Huang
R. He
DiffM
71
0
0
10 Apr 2025
MM-IFEngine: Towards Multimodal Instruction Following
MM-IFEngine: Towards Multimodal Instruction Following
Shengyuan Ding
Shenxi Wu
Xiangyu Zhao
Yuhang Zang
Haodong Duan
Xiaoyi Dong
Pan Zhang
Y. Cao
D. Lin
Jiaqi Wang
OffRL
56
1
0
10 Apr 2025
ZIP: An Efficient Zeroth-order Prompt Tuning for Black-box Vision-Language Models
ZIP: An Efficient Zeroth-order Prompt Tuning for Black-box Vision-Language Models
Seonghwan Park
Jaehyeon Jeong
Yongjun Kim
Jaeho Lee
Namhoon Lee
VLM
44
0
0
09 Apr 2025
Patch Matters: Training-free Fine-grained Image Caption Enhancement via Local Perception
Patch Matters: Training-free Fine-grained Image Caption Enhancement via Local Perception
Ruotian Peng
Haiying He
Yake Wei
Yandong Wen
D. Hu
VLM
39
0
0
09 Apr 2025
LVC: A Lightweight Compression Framework for Enhancing VLMs in Long Video Understanding
LVC: A Lightweight Compression Framework for Enhancing VLMs in Long Video Understanding
Ziyi Wang
Haoran Wu
Yiming Rong
Deyang Jiang
Yixin Zhang
Y. Zhao
Shuang Xu
Bo Xu
VLM
41
0
0
09 Apr 2025
OmniCaptioner: One Captioner to Rule Them All
OmniCaptioner: One Captioner to Rule Them All
Yiting Lu
Jiakang Yuan
Zhen Li
Shitian Zhao
Qi Qin
...
Lei Bai
Zhibo Chen
Peng Gao
Bo Zhang
Peng Gao
MLLM
79
0
0
09 Apr 2025
Face-LLaVA: Facial Expression and Attribute Understanding through Instruction Tuning
Face-LLaVA: Facial Expression and Attribute Understanding through Instruction Tuning
Ashutosh Chaubey
Xulang Guan
Mohammad Soleymani
CVBM
MLLM
VLM
66
0
0
09 Apr 2025
Are We Done with Object-Centric Learning?
Are We Done with Object-Centric Learning?
Alexander Rubinstein
Ameya Prabhu
Matthias Bethge
Seong Joon Oh
OCL
577
0
0
09 Apr 2025
Perception in Reflection
Perception in Reflection
Yana Wei
Liang Zhao
Kangheng Lin
En Yu
Yuang Peng
...
Jianjian Sun
Haoran Wei
Zheng Ge
Xiangyu Zhang
Vishal M. Patel
31
0
0
09 Apr 2025
Decoupling Contrastive Decoding: Robust Hallucination Mitigation in Multimodal Large Language Models
Decoupling Contrastive Decoding: Robust Hallucination Mitigation in Multimodal Large Language Models
Wei Chen
Xin Yan
Bin Wen
Fan Yang
Tingting Gao
Di Zhang
Long Chen
MLLM
92
0
0
09 Apr 2025
V-MAGE: A Game Evaluation Framework for Assessing Visual-Centric Capabilities in Multimodal Large Language Models
V-MAGE: A Game Evaluation Framework for Assessing Visual-Centric Capabilities in Multimodal Large Language Models
Xiangxi Zheng
Linjie Li
Z. Yang
Ping Yu
Alex Jinpeng Wang
Rui Yan
Yuan Yao
Lijuan Wang
LRM
21
0
0
08 Apr 2025
Transfer between Modalities with MetaQueries
Transfer between Modalities with MetaQueries
Xichen Pan
Satya Narayan Shukla
Aashu Singh
Zhuokai Zhao
Shlok Kumar Mishra
...
Jiuhai Chen
Kunpeng Li
F. Xu
Ji Hou
Saining Xie
DiffM
41
6
0
08 Apr 2025
Earth-Adapter: Bridge the Geospatial Domain Gaps with Mixture of Frequency Adaptation
Earth-Adapter: Bridge the Geospatial Domain Gaps with Mixture of Frequency Adaptation
Xiaoxing Hu
Ziyang Gong
Y. Wang
Yuru Jia
Gen Luo
Xue Yang
85
0
0
08 Apr 2025
PaMi-VDPO: Mitigating Video Hallucinations by Prompt-Aware Multi-Instance Video Preference Learning
PaMi-VDPO: Mitigating Video Hallucinations by Prompt-Aware Multi-Instance Video Preference Learning
Xinpeng Ding
K. Zhang
Jinahua Han
Lanqing Hong
Hang Xu
X. Li
MLLM
VLM
124
0
0
08 Apr 2025
Measuring Déjà vu Memorization Efficiently
Measuring Déjà vu Memorization Efficiently
Narine Kokhlikyan
Bargav Jayaraman
Florian Bordes
Chuan Guo
Kamalika Chaudhuri
23
1
0
08 Apr 2025
OmniSVG: A Unified Scalable Vector Graphics Generation Model
OmniSVG: A Unified Scalable Vector Graphics Generation Model
Yiying Yang
Wei Cheng
Sijin Chen
Xianfang Zeng
Jiaxu Zhang
Liao Wang
Gang Yu
Xingjun Ma
Yu Jiang
VLM
40
0
0
08 Apr 2025
SVLTA: Benchmarking Vision-Language Temporal Alignment via Synthetic Video Situation
SVLTA: Benchmarking Vision-Language Temporal Alignment via Synthetic Video Situation
Hao Du
Bo Wu
Yan Lu
Zhendong Mao
22
0
0
08 Apr 2025
MDK12-Bench: A Multi-Discipline Benchmark for Evaluating Reasoning in Multimodal Large Language Models
MDK12-Bench: A Multi-Discipline Benchmark for Evaluating Reasoning in Multimodal Large Language Models
Pengfei Zhou
Fanrui Zhang
Xiaopeng Peng
Zhaopan Xu
Jiaxin Ai
...
Kai Wang
Xiaojun Chang
Wenqi Shao
Yang You
K. Zhang
ELM
LRM
30
0
0
08 Apr 2025
On the Suitability of Reinforcement Fine-Tuning to Visual Tasks
On the Suitability of Reinforcement Fine-Tuning to Visual Tasks
X. Chen
Wei Li
Chunxu Liu
Chi Xie
Xiaoyan Hu
Chengqian Ma
Feng Zhu
Rui Zhao
ReLM
LRM
54
0
0
08 Apr 2025
SCAM: A Real-World Typographic Robustness Evaluation for Multimodal Foundation Models
SCAM: A Real-World Typographic Robustness Evaluation for Multimodal Foundation Models
Justus Westerhoff
Erblina Purellku
Jakob Hackstein
Jonas Loos
Leo Pinetzki
Lorenz Hufe
AAML
28
0
0
07 Apr 2025
SmolVLM: Redefining small and efficient multimodal models
SmolVLM: Redefining small and efficient multimodal models
Andres Marafioti
Orr Zohar
Miquel Farré
Merve Noyan
Elie Bakouch
...
Hugo Larcher
Mathieu Morlon
Lewis Tunstall
Leandro von Werra
Thomas Wolf
VLM
34
4
0
07 Apr 2025
Enhancing Compositional Reasoning in Vision-Language Models with Synthetic Preference Data
Enhancing Compositional Reasoning in Vision-Language Models with Synthetic Preference Data
Samarth Mishra
Kate Saenko
Venkatesh Saligrama
CoGe
LRM
37
0
0
07 Apr 2025
URECA: Unique Region Caption Anything
URECA: Unique Region Caption Anything
Sangbeom Lim
J. Kim
Heeji Yoon
Jaewoo Jung
Seungryong Kim
29
0
0
07 Apr 2025
Taxonomy-Aware Evaluation of Vision-Language Models
Taxonomy-Aware Evaluation of Vision-Language Models
Vésteinn Snæbjarnarson
Kevin Du
Niklas Stoehr
Serge J. Belongie
Ryan Cotterell
Nico Lang
Stella Frank
27
0
0
07 Apr 2025
OrderChain: A General Prompting Paradigm to Improve Ordinal Understanding Ability of MLLM
OrderChain: A General Prompting Paradigm to Improve Ordinal Understanding Ability of MLLM
Jinhong Wang
Shuo Tong
Jian Liu
Dongqi Tang
Weiqiang Wang
Wentong Li
Hongxia Xu
D. Z. Chen
J. Chen
Jian Wu
LRM
21
0
0
07 Apr 2025
The 1st Solution for 4th PVUW MeViS Challenge: Unleashing the Potential of Large Multimodal Models for Referring Video Segmentation
The 1st Solution for 4th PVUW MeViS Challenge: Unleashing the Potential of Large Multimodal Models for Referring Video Segmentation
Hao Fang
Runmin Cong
Xiankai Lu
Z. Chen
Wei Zhang
29
0
0
07 Apr 2025
Ternarization of Vision Language Models for use on edge devices
Ternarization of Vision Language Models for use on edge devices
Ben Crulis
Cyril de Runz
Barthélémy Serres
Gilles Venturini
VLM
55
0
0
07 Apr 2025
InteractVLM: 3D Interaction Reasoning from 2D Foundational Models
InteractVLM: 3D Interaction Reasoning from 2D Foundational Models
Sai Kumar Dwivedi
Dimitrije Antić
Shashank Tripathi
Omid Taheri
Cordelia Schmid
M. Black
Dimitrios Tzionas
26
1
0
07 Apr 2025
Caption Anything in Video: Fine-grained Object-centric Captioning via Spatiotemporal Multimodal Prompting
Caption Anything in Video: Fine-grained Object-centric Captioning via Spatiotemporal Multimodal Prompting
Yunlong Tang
Jing Bi
Chao Huang
Susan Liang
Daiki Shimada
...
Jinxi He
Liu He
Zeliang Zhang
Jiebo Luo
Chenliang Xu
34
0
0
07 Apr 2025
Grounding 3D Object Affordance with Language Instructions, Visual Observations and Interactions
Grounding 3D Object Affordance with Language Instructions, Visual Observations and Interactions
He Zhu
Quyu Kong
Kechun Xu
Xunlong Xia
Bing Deng
Jieping Ye
R. Xiong
Y. Wang
30
0
0
07 Apr 2025
REEF: Relevance-Aware and Efficient LLM Adapter for Video Understanding
REEF: Relevance-Aware and Efficient LLM Adapter for Video Understanding
Sakib Reza
Xiyun Song
Heather Yu
Zongfang Lin
Mohsen Moghaddam
Octavia Camps
23
0
0
07 Apr 2025
M2IV: Towards Efficient and Fine-grained Multimodal In-Context Learning in Large Vision-Language Models
M2IV: Towards Efficient and Fine-grained Multimodal In-Context Learning in Large Vision-Language Models
Yanshu Li
Hongyang He
Yi Cao
Qisen Cheng
Xiang Fu
Ruixiang Tang
VLM
40
0
0
06 Apr 2025
The Point, the Vision and the Text: Does Point Cloud Boost Spatial Reasoning of Large Language Models?
The Point, the Vision and the Text: Does Point Cloud Boost Spatial Reasoning of Large Language Models?
Weichen Zhang
Ruiying Peng
Chen Gao
Jianjie Fang
Xin Zeng
...
Z. Wang
Jinqiang Cui
Xin Wang
Xinlei Chen
Y. Li
LRM
71
0
0
06 Apr 2025
Domain Generalization for Face Anti-spoofing via Content-aware Composite Prompt Engineering
Domain Generalization for Face Anti-spoofing via Content-aware Composite Prompt Engineering
J. Guo
Ajian Liu
Yunfeng Diao
J. Zhang
Hui Ma
Bo Zhao
Richang Hong
Meng Wang
21
0
0
06 Apr 2025
UniToken: Harmonizing Multimodal Understanding and Generation through Unified Visual Encoding
UniToken: Harmonizing Multimodal Understanding and Generation through Unified Visual Encoding
Yang Jiao
Haibo Qiu
Zequn Jie
S. Chen
Jingjing Chen
Lin Ma
Yu Jiang
26
2
0
06 Apr 2025
MedM-VL: What Makes a Good Medical LVLM?
MedM-VL: What Makes a Good Medical LVLM?
Yiming Shi
Shaoshuai Yang
Xun Zhu
Haoyu Wang
Miao Li
Ji Wu
VLM
40
1
0
06 Apr 2025
JarvisIR: Elevating Autonomous Driving Perception with Intelligent Image Restoration
JarvisIR: Elevating Autonomous Driving Perception with Intelligent Image Restoration
Yunlong Lin
Zixu Lin
Haoyu Chen
Panwang Pan
C. Li
Sixiang Chen
Yeying Jin
W. J. Li
Xinghao Ding
25
1
0
05 Apr 2025
Window Token Concatenation for Efficient Visual Large Language Models
Window Token Concatenation for Efficient Visual Large Language Models
Yifan Li
Wentao Bao
Botao Ye
Zhen Tan
Tianlong Chen
Huan Liu
Yu Kong
VLM
39
0
0
05 Apr 2025
TARAC: Mitigating Hallucination in LVLMs via Temporal Attention Real-time Accumulative Connection
TARAC: Mitigating Hallucination in LVLMs via Temporal Attention Real-time Accumulative Connection
C. Xie
Tongxuan Liu
Lei Jiang
Yuting Zeng
J. Guo
Yunheng Shen
Weizhe Huang
Jing Li
Xiaohua Xu
VLM
56
0
0
05 Apr 2025
NuScenes-SpatialQA: A Spatial Understanding and Reasoning Benchmark for Vision-Language Models in Autonomous Driving
NuScenes-SpatialQA: A Spatial Understanding and Reasoning Benchmark for Vision-Language Models in Autonomous Driving
Kexin Tian
Jingrui Mao
Y. Zhang
Jiwan Jiang
Yang Zhou
Zhengzhong Tu
CoGe
60
0
0
04 Apr 2025
Multimodal Fusion and Vision-Language Models: A Survey for Robot Vision
Multimodal Fusion and Vision-Language Models: A Survey for Robot Vision
Xiaofeng Han
Shunpeng Chen
Zenghuang Fu
Zhe Feng
Lue Fan
...
Li Guo
Weiliang Meng
Xiaopeng Zhang
Rongtao Xu
Shibiao Xu
60
0
0
03 Apr 2025
Scaling Video-Language Models to 10K Frames via Hierarchical Differential Distillation
Scaling Video-Language Models to 10K Frames via Hierarchical Differential Distillation
Chuanqi Cheng
Jian-Yu Guan
Wei Yu Wu
Rui Yan
VLM
45
0
0
03 Apr 2025
A Survey of Large Language Models in Mental Health Disorder Detection on Social Media
A Survey of Large Language Models in Mental Health Disorder Detection on Social Media
Zhuohan Ge
Nicole Hu
Darian Li
Yubo Wang
Shihao Qi
Yuming Xu
Han Shi
J. Zhang
AI4MH
56
0
0
03 Apr 2025
VARGPT-v1.1: Improve Visual Autoregressive Large Unified Model via Iterative Instruction Tuning and Reinforcement Learning
VARGPT-v1.1: Improve Visual Autoregressive Large Unified Model via Iterative Instruction Tuning and Reinforcement Learning
Xianwei Zhuang
Yuxin Xie
Yufan Deng
Dongchao Yang
Liming Liang
Jinghan Ru
Yuguo Yin
Yuexian Zou
68
1
0
03 Apr 2025
Envisioning Beyond the Pixels: Benchmarking Reasoning-Informed Visual Editing
Envisioning Beyond the Pixels: Benchmarking Reasoning-Informed Visual Editing
Xiangyu Zhao
Peiyuan Zhang
Kexian Tang
Hao Li
Zicheng Zhang
Guangtao Zhai
Junchi Yan
Hua Yang
Xue Yang
Haodong Duan
VLM
LRM
41
0
0
03 Apr 2025
SocialGesture: Delving into Multi-person Gesture Understanding
SocialGesture: Delving into Multi-person Gesture Understanding
Xu Cao
Pranav Virupaksha
Wenqi Jia
Bolin Lai
Fiona Ryan
Sangmin Lee
James M. Rehg
SLR
49
0
0
03 Apr 2025
Sparse Autoencoders Learn Monosemantic Features in Vision-Language Models
Sparse Autoencoders Learn Monosemantic Features in Vision-Language Models
Mateusz Pach
Shyamgopal Karthik
Quentin Bouniot
Serge Belongie
Zeynep Akata
VLM
62
0
0
03 Apr 2025
STING-BEE: Towards Vision-Language Model for Real-World X-ray Baggage Security Inspection
STING-BEE: Towards Vision-Language Model for Real-World X-ray Baggage Security Inspection
Divya Velayudhan
A. Ahmed
Mohamad Alansari
Neha Gour
Abderaouf Behouch
...
Muzammal Naseer
Juergen Gall
Mohammed Bennamoun
Ernesto Damiani
N. Werghi
42
0
0
03 Apr 2025
Leveraging Modality Tags for Enhanced Cross-Modal Video Retrieval
Leveraging Modality Tags for Enhanced Cross-Modal Video Retrieval
A. Fragomeni
Dima Damen
Michael Wray
33
0
0
02 Apr 2025
Slow-Fast Architecture for Video Multi-Modal Large Language Models
Slow-Fast Architecture for Video Multi-Modal Large Language Models
Min Shi
Shihao Wang
Chieh-Yun Chen
Jitesh Jain
Kai Wang
Junjun Xiong
Guilin Liu
Zhiding Yu
Humphrey Shi
31
1
0
02 Apr 2025
Previous
123456...424344
Next