ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2304.08485
  4. Cited By
Visual Instruction Tuning

Visual Instruction Tuning

17 April 2023
Haotian Liu
Chunyuan Li
Qingyang Wu
Yong Jae Lee
    SyDa
    VLM
    MLLM
ArXivPDFHTML

Papers citing "Visual Instruction Tuning"

50 / 2,159 papers shown
Title
Don't Miss the Forest for the Trees: Attentional Vision Calibration for
  Large Vision Language Models
Don't Miss the Forest for the Trees: Attentional Vision Calibration for Large Vision Language Models
Sangmin Woo
Donguk Kim
Jaehyuk Jang
Yubin Choi
Changick Kim
38
12
0
28 May 2024
Visual Anchors Are Strong Information Aggregators For Multimodal Large
  Language Model
Visual Anchors Are Strong Information Aggregators For Multimodal Large Language Model
Haogeng Liu
Quanzeng You
Xiaotian Han
Yongfei Liu
Huaibo Huang
Ran He
Hongxia Yang
31
2
0
28 May 2024
Cross-Modal Safety Alignment: Is textual unlearning all you need?
Cross-Modal Safety Alignment: Is textual unlearning all you need?
Trishna Chakraborty
Erfan Shayegani
Zikui Cai
Nael B. Abu-Ghazaleh
M. Salman Asif
Yue Dong
A. Roy-Chowdhury
Chengyu Song
39
15
0
27 May 2024
Matryoshka Multimodal Models
Matryoshka Multimodal Models
Mu Cai
Jianwei Yang
Jianfeng Gao
Yong Jae Lee
VLM
39
25
0
27 May 2024
Does Diffusion Beat GAN in Image Super Resolution?
Does Diffusion Beat GAN in Image Super Resolution?
Denis Kuznedelev
Valerii Startsev
Daniil Shlenskii
Sergey Kastryulin
36
4
0
27 May 2024
Hawk: Learning to Understand Open-World Video Anomalies
Hawk: Learning to Understand Open-World Video Anomalies
Jiaqi Tang
Hao Lu
Ruizheng Wu
Xiaogang Xu
Ke Ma
Cheng Fang
Bin Guo
Jiangbo Lu
Qifeng Chen
Ying-Cong Chen
VLM
40
9
0
27 May 2024
On Mesa-Optimization in Autoregressively Trained Transformers: Emergence
  and Capability
On Mesa-Optimization in Autoregressively Trained Transformers: Emergence and Capability
Chenyu Zheng
Wei Huang
Rongzheng Wang
Guoqiang Wu
Jun Zhu
Chongxuan Li
34
1
0
27 May 2024
PromptFix: You Prompt and We Fix the Photo
PromptFix: You Prompt and We Fix the Photo
Yongsheng Yu
Ziyun Zeng
Hang Hua
Jianlong Fu
Jiebo Luo
MLLM
DiffM
VLM
33
20
0
27 May 2024
Reason3D: Searching and Reasoning 3D Segmentation via Large Language Model
Reason3D: Searching and Reasoning 3D Segmentation via Large Language Model
Kuan-Chih Huang
Xiangtai Li
Lu Qi
Shuicheng Yan
Ming-Hsuan Yang
LRM
68
9
0
27 May 2024
LARM: Large Auto-Regressive Model for Long-Horizon Embodied Intelligence
LARM: Large Auto-Regressive Model for Long-Horizon Embodied Intelligence
Zhuoling Li
Xiaogang Xu
Zhenhua Xu
Sernam Lim
Hengshuang Zhao
LM&Ro
40
2
0
27 May 2024
VoCoT: Unleashing Visually Grounded Multi-Step Reasoning in Large Multi-Modal Models
VoCoT: Unleashing Visually Grounded Multi-Step Reasoning in Large Multi-Modal Models
Zejun Li
Ruipu Luo
Jiwen Zhang
Minghui Qiu
Zhongyu Wei
Zhongyu Wei
LRM
MLLM
60
7
0
27 May 2024
Accelerating Transformers with Spectrum-Preserving Token Merging
Accelerating Transformers with Spectrum-Preserving Token Merging
Hoai-Chau Tran
D. M. Nguyen
Duy M. Nguyen
Trung Thanh Nguyen
Ngan Le
Pengtao Xie
Daniel Sonntag
James Y. Zou
Binh T. Nguyen
Mathias Niepert
32
8
0
25 May 2024
Prompt-Aware Adapter: Towards Learning Adaptive Visual Tokens for
  Multimodal Large Language Models
Prompt-Aware Adapter: Towards Learning Adaptive Visual Tokens for Multimodal Large Language Models
Yue Zhang
Hehe Fan
Yi Yang
43
3
0
24 May 2024
Meteor: Mamba-based Traversal of Rationale for Large Language and Vision
  Models
Meteor: Mamba-based Traversal of Rationale for Large Language and Vision Models
Byung-Kwan Lee
Chae Won Kim
Beomchan Park
Yonghyun Ro
MLLM
LRM
25
17
0
24 May 2024
V-Zen: Efficient GUI Understanding and Precise Grounding With A Novel
  Multimodal LLM
V-Zen: Efficient GUI Understanding and Precise Grounding With A Novel Multimodal LLM
Abdur Rahman
Rajat Chawla
Muskaan Kumar
Arkajit Datta
Adarsh Jha
NS Mukunda
Ishaan Bhola
40
2
0
24 May 2024
Talk to Parallel LiDARs: A Human-LiDAR Interaction Method Based on 3D
  Visual Grounding
Talk to Parallel LiDARs: A Human-LiDAR Interaction Method Based on 3D Visual Grounding
Yuhang Liu
Boyi Sun
Guixu Zheng
Yishuo Wang
Jing Wang
Fei-Yue Wang
34
2
0
24 May 2024
Alleviating Hallucinations in Large Vision-Language Models through Hallucination-Induced Optimization
Alleviating Hallucinations in Large Vision-Language Models through Hallucination-Induced Optimization
Beitao Chen
Xinyu Lyu
Lianli Gao
Jingkuan Song
Hengtao Shen
MLLM
54
10
0
24 May 2024
How Culturally Aware are Vision-Language Models?
How Culturally Aware are Vision-Language Models?
Olena Burda-Lassen
Aman Chadha
Shashank Goswami
Vinija Jain
VLM
34
0
0
24 May 2024
DEEM: Diffusion Models Serve as the Eyes of Large Language Models for Image Perception
DEEM: Diffusion Models Serve as the Eyes of Large Language Models for Image Perception
Run Luo
Yunshui Li
Longze Chen
Wanwei He
Ting-En Lin
...
Zikai Song
Xiaobo Xia
Tongliang Liu
Min Yang
Binyuan Hui
VLM
DiffM
70
14
0
24 May 2024
Enhancing Visual-Language Modality Alignment in Large Vision Language Models via Self-Improvement
Enhancing Visual-Language Modality Alignment in Large Vision Language Models via Self-Improvement
Xiyao Wang
Jiuhai Chen
Zhaoyang Wang
Yuhang Zhou
Yiyang Zhou
...
Tianyi Zhou
Tom Goldstein
Parminder Bhatia
Furong Huang
Cao Xiao
60
33
0
24 May 2024
Dynamic Mixture of Experts: An Auto-Tuning Approach for Efficient Transformer Models
Dynamic Mixture of Experts: An Auto-Tuning Approach for Efficient Transformer Models
Yongxin Guo
Zhenglin Cheng
Xiaoying Tang
Tao R. Lin
Tao Lin
MoE
53
7
0
23 May 2024
A Survey on Vision-Language-Action Models for Embodied AI
A Survey on Vision-Language-Action Models for Embodied AI
Yueen Ma
Zixing Song
Yuzheng Zhuang
Jianye Hao
Irwin King
LM&Ro
67
41
0
23 May 2024
LookHere: Vision Transformers with Directed Attention Generalize and
  Extrapolate
LookHere: Vision Transformers with Directed Attention Generalize and Extrapolate
A. Fuller
Daniel G. Kyrollos
Yousef Yassin
James R. Green
34
2
0
22 May 2024
Curriculum Direct Preference Optimization for Diffusion and Consistency Models
Curriculum Direct Preference Optimization for Diffusion and Consistency Models
Florinel-Alin Croitoru
Vlad Hondru
Radu Tudor Ionescu
N. Sebe
Mubarak Shah
EGVM
84
5
0
22 May 2024
Context-Enhanced Video Moment Retrieval with Large Language Models
Context-Enhanced Video Moment Retrieval with Large Language Models
Weijia Liu
Bo Miao
Jiuxin Cao
Xueling Zhu
Bo Liu
Mehwish Nasim
Ajmal Saeed Mian
29
2
0
21 May 2024
Single Image Unlearning: Efficient Machine Unlearning in Multimodal Large Language Models
Single Image Unlearning: Efficient Machine Unlearning in Multimodal Large Language Models
Jiaqi Li
Qianshan Wei
Chuanyi Zhang
Guilin Qi
Miaozeng Du
Yongrui Chen
Sheng Bi
Fan Liu
VLM
MU
67
12
0
21 May 2024
A Survey of Deep Learning-based Radiology Report Generation Using Multimodal Data
A Survey of Deep Learning-based Radiology Report Generation Using Multimodal Data
Xinyi Wang
Grazziela Figueredo
Ruizhe Li
W. Zhang
Weitong Chen
Xin Chen
MedIm
ViT
41
2
0
21 May 2024
MTVQA: Benchmarking Multilingual Text-Centric Visual Question Answering
MTVQA: Benchmarking Multilingual Text-Centric Visual Question Answering
Jingqun Tang
Qi Liu
Yongjie Ye
Jinghui Lu
Shubo Wei
...
Yanjie Wang
Yuliang Liu
Hao Liu
Xiang Bai
Can Huang
34
22
0
20 May 2024
TinyLLaVA Factory: A Modularized Codebase for Small-scale Large
  Multimodal Models
TinyLLaVA Factory: A Modularized Codebase for Small-scale Large Multimodal Models
Junlong Jia
Ying Hu
Xi Weng
Yiming Shi
Miao Li
...
Baichuan Zhou
Ziyu Liu
Jie Luo
Lei Huang
Ji Wu
30
9
0
20 May 2024
Uni-MoE: Scaling Unified Multimodal LLMs with Mixture of Experts
Uni-MoE: Scaling Unified Multimodal LLMs with Mixture of Experts
Yunxin Li
Shenyuan Jiang
Baotian Hu
Longyue Wang
Wanqi Zhong
Wenhan Luo
Lin Ma
Min-Ling Zhang
MoE
34
28
0
18 May 2024
ReasonPix2Pix: Instruction Reasoning Dataset for Advanced Image Editing
ReasonPix2Pix: Instruction Reasoning Dataset for Advanced Image Editing
Ying Jin
Pengyang Ling
Xiao-wen Dong
Pan Zhang
Jiaqi Wang
Dahua Lin
24
2
0
18 May 2024
Chameleon: Mixed-Modal Early-Fusion Foundation Models
Chameleon: Mixed-Modal Early-Fusion Foundation Models
Chameleon Team
MLLM
60
253
0
16 May 2024
Contextual Emotion Recognition using Large Vision Language Models
Contextual Emotion Recognition using Large Vision Language Models
Yasaman Etesam
Özge Nilay Yalçin
Chuxuan Zhang
Angelica Lim
VLM
69
3
0
14 May 2024
SambaNova SN40L: Scaling the AI Memory Wall with Dataflow and
  Composition of Experts
SambaNova SN40L: Scaling the AI Memory Wall with Dataflow and Composition of Experts
R. Prabhakar
R. Sivaramakrishnan
Darshan Gandhi
Yun Du
Mingran Wang
...
Urmish Thakker
Dawei Huang
Sumti Jairath
Kevin J. Brown
K. Olukotun
MoE
39
12
0
13 May 2024
SuFIA: Language-Guided Augmented Dexterity for Robotic Surgical
  Assistants
SuFIA: Language-Guided Augmented Dexterity for Robotic Surgical Assistants
Masoud Moghani
Lars Doorenbos
Will Panitch
Sean Huver
Mahdi Azizian
Ken Goldberg
Animesh Garg
32
9
0
08 May 2024
THRONE: An Object-based Hallucination Benchmark for the Free-form Generations of Large Vision-Language Models
THRONE: An Object-based Hallucination Benchmark for the Free-form Generations of Large Vision-Language Models
Prannay Kaul
Zhizhong Li
Hao-Yu Yang
Yonatan Dukler
Ashwin Swaminathan
C. Taylor
Stefano Soatto
HILM
49
15
0
08 May 2024
Light-VQA+: A Video Quality Assessment Model for Exposure Correction
  with Vision-Language Guidance
Light-VQA+: A Video Quality Assessment Model for Exposure Correction with Vision-Language Guidance
Xunchu Zhou
Xiaohong Liu
Yunlong Dong
Tengchuan Kou
Yixuan Gao
Zicheng Zhang
Chunyi Li
Haoning Wu
Guangtao Zhai
33
3
0
06 May 2024
Pose Priors from Language Models
Pose Priors from Language Models
Sanjay Subramanian
Evonne Ng
Lea Muller
Dan Klein
Shiry Ginosar
Trevor Darrell
41
3
0
06 May 2024
Simplifying Multimodality: Unimodal Approach to Multimodal Challenges in
  Radiology with General-Domain Large Language Model
Simplifying Multimodality: Unimodal Approach to Multimodal Challenges in Radiology with General-Domain Large Language Model
Seonhee Cho
Choonghan Kim
Jiho Lee
Chetan Chilkunda
Sujin Choi
Joo Heung Yoon
44
0
0
29 Apr 2024
Do Vision & Language Decoders use Images and Text equally? How Self-consistent are their Explanations?
Do Vision & Language Decoders use Images and Text equally? How Self-consistent are their Explanations?
Letitia Parcalabescu
Anette Frank
MLLM
CoGe
VLM
82
3
0
29 Apr 2024
PromptCIR: Blind Compressed Image Restoration with Prompt Learning
PromptCIR: Blind Compressed Image Restoration with Prompt Learning
Bingchen Li
Xin Li
Yiting Lu
Ruoyu Feng
Mengxi Guo
Shijie Zhao
Li Zhang
Zhibo Chen
30
13
0
26 Apr 2024
List Items One by One: A New Data Source and Learning Paradigm for Multimodal LLMs
List Items One by One: A New Data Source and Learning Paradigm for Multimodal LLMs
An Yan
Zhengyuan Yang
Junda Wu
Wanrong Zhu
Jianwei Yang
...
K. Lin
Jianfeng Wang
Julian McAuley
Jianfeng Gao
Lijuan Wang
LRM
34
12
0
25 Apr 2024
An empirical study of LLaMA3 quantization: from LLMs to MLLMs
An empirical study of LLaMA3 quantization: from LLMs to MLLMs
Wei Huang
Xingyu Zheng
Xudong Ma
Haotong Qin
Chengtao Lv
Hong Chen
Jie Luo
Xiaojuan Qi
Xianglong Liu
Michele Magno
MQ
54
38
0
22 Apr 2024
Clio: Real-time Task-Driven Open-Set 3D Scene Graphs
Clio: Real-time Task-Driven Open-Set 3D Scene Graphs
Dominic Maggio
Yun Chang
Nathan Hughes
Matthew Trang
Dan Griffith
Carlyn Dougherty
Eric Cristofalo
Lukas Schmid
Luca Carlone
3DV
36
32
0
21 Apr 2024
Physical Backdoor Attack can Jeopardize Driving with
  Vision-Large-Language Models
Physical Backdoor Attack can Jeopardize Driving with Vision-Large-Language Models
Zhenyang Ni
Rui Ye
Yuxian Wei
Zhen Xiang
Yanfeng Wang
Siheng Chen
AAML
32
9
0
19 Apr 2024
TextSquare: Scaling up Text-Centric Visual Instruction Tuning
TextSquare: Scaling up Text-Centric Visual Instruction Tuning
Jingqun Tang
Chunhui Lin
Zhen Zhao
Shubo Wei
Binghong Wu
...
Yuliang Liu
Hao Liu
Yuan Xie
Xiang Bai
Can Huang
LRM
VLM
MLLM
64
28
0
19 Apr 2024
COMBO: Compositional World Models for Embodied Multi-Agent Cooperation
COMBO: Compositional World Models for Embodied Multi-Agent Cooperation
Hongxin Zhang
Zeyuan Wang
Qiushi Lyu
Zheyuan Zhang
Sunli Chen
Tianmin Shu
Yilun Du
Kwonjoon Lee
Yilun Du
Chuang Gan
41
12
0
16 Apr 2024
MMInA: Benchmarking Multihop Multimodal Internet Agents
MMInA: Benchmarking Multihop Multimodal Internet Agents
Ziniu Zhang
Shulin Tian
Liangyu Chen
Ziwei Liu
LLMAG
LM&Ro
27
13
0
15 Apr 2024
HOI-Ref: Hand-Object Interaction Referral in Egocentric Vision
HOI-Ref: Hand-Object Interaction Referral in Egocentric Vision
Siddhant Bansal
Michael Wray
Dima Damen
31
3
0
15 Apr 2024
Tri-modal Confluence with Temporal Dynamics for Scene Graph Generation
  in Operating Rooms
Tri-modal Confluence with Temporal Dynamics for Scene Graph Generation in Operating Rooms
Diandian Guo
Manxi Lin
Jialun Pei
He Tang
Yueming Jin
Pheng-Ann Heng
24
1
0
14 Apr 2024
Previous
123...394041424344
Next