Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2304.08485
Cited By
Visual Instruction Tuning
17 April 2023
Haotian Liu
Chunyuan Li
Qingyang Wu
Yong Jae Lee
SyDa
VLM
MLLM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Visual Instruction Tuning"
50 / 3,278 papers shown
Title
Enhancing Video Transformers for Action Understanding with VLM-aided Training
Hui Lu
Hu Jian
Ronald Poppe
A. A. Salah
44
1
0
24 Mar 2024
Explore until Confident: Efficient Exploration for Embodied Question Answering
Allen Z. Ren
Jaden Clark
Anushri Dixit
Masha Itkina
Anirudha Majumdar
Dorsa Sadigh
47
29
0
23 Mar 2024
Hear Me, See Me, Understand Me: Audio-Visual Autism Behavior Recognition
Shijian Deng
Erin E. Kosloski
Siddhi Patel
Zeke A. Barnett
Yiyang Nan
...
William T. Doan
Matthew Wang
Harsh Singh
P. Rollins
Yapeng Tian
39
4
0
22 Mar 2024
CoNVOI: Context-aware Navigation using Vision Language Models in Outdoor and Indoor Environments
A. Sathyamoorthy
K. Weerakoon
Mohamed Bashir Elnoor
Anuj Zore
Brian Ichter
Fei Xia
Jie Tan
Wenhao Yu
Dinesh Manocha
LM&Ro
63
17
0
22 Mar 2024
LLaVA-PruMerge: Adaptive Token Reduction for Efficient Large Multimodal Models
Yuzhang Shang
Mu Cai
Bingxin Xu
Yong Jae Lee
Yan Yan
VLM
55
107
0
22 Mar 2024
Neural Plasticity-Inspired Multimodal Foundation Model for Earth Observation
Zhitong Xiong
Yi Wang
Fahong Zhang
Adam J. Stewart
Joelle Hanna
Damian Borth
Ioannis Papoutsis
B. L. Saux
Gustau Camps-Valls
Xiao Xiang Zhu
AI4CE
81
14
0
22 Mar 2024
Selectively Informative Description can Reduce Undesired Embedding Entanglements in Text-to-Image Personalization
Jimyeong Kim
Jungwon Park
Wonjong Rhee
DiffM
38
5
0
22 Mar 2024
Not All Attention is Needed: Parameter and Computation Efficient Transfer Learning for Multi-modal Large Language Models
Qiong Wu
Weihao Ye
Yiyi Zhou
Xiaoshuai Sun
Rongrong Ji
MoE
52
1
0
22 Mar 2024
MSCoTDet: Language-driven Multi-modal Fusion for Improved Multispectral Pedestrian Detection
Taeheon Kim
Sangyun Chung
Damin Yeom
Youngjoon Yu
Hak Gu Kim
Y. Ro
43
2
0
22 Mar 2024
Surgical-LVLM: Learning to Adapt Large Vision-Language Model for Grounded Visual Question Answering in Robotic Surgery
Guan-Feng Wang
Long Bai
Wan Jun Nah
Jie Wang
Zhaoxi Zhang
Zhen Chen
Jinlin Wu
Mobarakol Islam
Hongbin Liu
Hongliang Ren
51
14
0
22 Mar 2024
VidLA: Video-Language Alignment at Scale
Mamshad Nayeem Rizve
Fan Fei
Jayakrishnan Unnikrishnan
Son Tran
Benjamin Z. Yao
Belinda Zeng
Mubarak Shah
Trishul Chilimbi
VLM
AI4TS
63
4
0
21 Mar 2024
Can 3D Vision-Language Models Truly Understand Natural Language?
Weipeng Deng
Jihan Yang
Runyu Ding
Jiahui Liu
Yijiang Li
Xiaojuan Qi
Edith C.H. Ngai
47
4
0
21 Mar 2024
MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems?
Renrui Zhang
Dongzhi Jiang
Yichi Zhang
Haokun Lin
Ziyu Guo
...
Aojun Zhou
Pan Lu
Kai-Wei Chang
Peng Gao
Hongsheng Li
34
173
0
21 Mar 2024
Parameter-Efficient Fine-Tuning for Large Models: A Comprehensive Survey
Zeyu Han
Chao Gao
Jinyang Liu
Jeff Zhang
Sai Qian Zhang
150
319
0
21 Mar 2024
MyVLM: Personalizing VLMs for User-Specific Queries
Yuval Alaluf
Elad Richardson
Sergey Tulyakov
Kfir Aberman
Daniel Cohen-Or
MLLM
VLM
43
18
0
21 Mar 2024
PSALM: Pixelwise SegmentAtion with Large Multi-Modal Model
Zheng Zhang
Yeyao Ma
Enming Zhang
Xiang Bai
VLM
MLLM
42
32
0
21 Mar 2024
Pensieve: Retrospect-then-Compare Mitigates Visual Hallucination
Dingchen Yang
Bowen Cao
Guang Chen
Changjun Jiang
56
7
0
21 Mar 2024
Empowering Segmentation Ability to Multi-modal Large Language Models
Yuqi Yang
Peng-Tao Jiang
Jing Wang
Hao Zhang
Kai Zhao
Jinwei Chen
Yue Liu
LRM
VLM
35
3
0
21 Mar 2024
Multi-Modal Hallucination Control by Visual Information Grounding
Alessandro Favero
L. Zancato
Matthew Trager
Siddharth Choudhary
Pramuditha Perera
Alessandro Achille
Ashwin Swaminathan
Stefano Soatto
MLLM
90
63
0
20 Mar 2024
Improved Baselines for Data-efficient Perceptual Augmentation of LLMs
Théophane Vallaeys
Mustafa Shukor
Matthieu Cord
Jakob Verbeek
59
12
0
20 Mar 2024
Inserting Faces inside Captions: Image Captioning with Attention Guided Merging
Yannis Tevissen
Khalil Guetari
Marine Tassel
Erwan Kerleroux
Frédéric Petitpont
48
0
0
20 Mar 2024
Mora: Enabling Generalist Video Generation via A Multi-Agent Framework
Zhengqing Yuan
Ruoxi Chen
Zhaoxu Li
Haolong Jia
Lifang He
Chi Wang
Lichao Sun
VGen
68
27
0
20 Mar 2024
Chain-of-Spot: Interactive Reasoning Improves Large Vision-Language Models
Zuyan Liu
Yuhao Dong
Yongming Rao
Jie Zhou
Jiwen Lu
LRM
27
13
0
19 Mar 2024
When Do We Not Need Larger Vision Models?
Baifeng Shi
Ziyang Wu
Maolin Mao
Xin Wang
Trevor Darrell
VLM
LRM
59
42
0
19 Mar 2024
mPLUG-DocOwl 1.5: Unified Structure Learning for OCR-free Document Understanding
Anwen Hu
Haiyang Xu
Jiabo Ye
Mingshi Yan
Liang Zhang
...
Chen Li
Ji Zhang
Qin Jin
Fei Huang
Jingren Zhou
VLM
49
106
0
19 Mar 2024
HYDRA: A Hyper Agent for Dynamic Compositional Visual Reasoning
Fucai Ke
Zhixi Cai
Simindokht Jahangard
Weiqing Wang
P. D. Haghighi
Hamid Rezatofighi
LRM
56
10
0
19 Mar 2024
VisualCritic: Making LMMs Perceive Visual Quality Like Humans
Zhipeng Huang
Zhizheng Zhang
Yiting Lu
Zheng-Jun Zha
Zhibo Chen
Baining Guo
MLLM
60
12
0
19 Mar 2024
RelationVLM: Making Large Vision-Language Models Understand Visual Relations
Zhipeng Huang
Zhizheng Zhang
Zheng-Jun Zha
Yan Lu
Baining Guo
VLM
44
3
0
19 Mar 2024
Towards Multimodal In-Context Learning for Vision & Language Models
Sivan Doveh
Shaked Perek
M. Jehanzeb Mirza
Wei Lin
Amit Alfassy
Assaf Arbelle
S. Ullman
Leonid Karlinsky
VLM
114
14
0
19 Mar 2024
As Firm As Their Foundations: Can open-sourced foundation models be used to create adversarial examples for downstream tasks?
Anjun Hu
Jindong Gu
Francesco Pinto
Konstantinos Kamnitsas
Philip Torr
AAML
SILM
45
5
0
19 Mar 2024
WoLF: Wide-scope Large Language Model Framework for CXR Understanding
Seil Kang
Donghyun Kim
Junhyeok Kim
Hyo Kyung Lee
Seong Jae Hwang
51
2
0
19 Mar 2024
VisionGPT: LLM-Assisted Real-Time Anomaly Detection for Safe Visual Navigation
Hao Wang
Jiayou Qin
Ashish Bastola
Xiwen Chen
John Suchanek
Zihao Gong
Abolfazl Razi
43
15
0
19 Mar 2024
VL-ICL Bench: The Devil in the Details of Multimodal In-Context Learning
Yongshuo Zong
Ondrej Bohdal
Timothy M. Hospedales
30
5
0
19 Mar 2024
MineDreamer: Learning to Follow Instructions via Chain-of-Imagination for Simulated-World Control
Enshen Zhou
Yiran Qin
Zhen-fei Yin
Yuzhou Huang
Ruimao Zhang
Lu Sheng
Yu Qiao
Jing Shao
LM&Ro
AI4CE
50
34
0
18 Mar 2024
HOIDiffusion: Generating Realistic 3D Hand-Object Interaction Data
Mengqi Zhang
Yang Fu
Zheng Ding
Sifei Liu
Zhuowen Tu
Xiaolong Wang
46
17
0
18 Mar 2024
SpatialPIN: Enhancing Spatial Reasoning Capabilities of Vision-Language Models through Prompting and Interacting 3D Priors
Chenyang Ma
Kai Lu
Ta-Ying Cheng
Niki Trigoni
Andrew Markham
LRM
40
8
0
18 Mar 2024
Agent3D-Zero: An Agent for Zero-shot 3D Understanding
Sha Zhang
Di Huang
Jiajun Deng
Shixiang Tang
Wanli Ouyang
Tong He
Yanyong Zhang
VGen
46
14
0
18 Mar 2024
Dynamic Tuning Towards Parameter and Inference Efficiency for ViT Adaptation
Wangbo Zhao
Jiasheng Tang
Yizeng Han
Yibing Song
Kai Wang
Gao Huang
F. Wang
Yang You
45
11
0
18 Mar 2024
Meta-Prompting for Automating Zero-shot Visual Recognition with LLMs
M. Jehanzeb Mirza
Leonid Karlinsky
Wei Lin
Sivan Doveh
Jakub Micorek
Mateusz Koziñski
Hilde Kuhene
Horst Possegger
VLM
MLLM
47
13
0
18 Mar 2024
LSKNet: A Foundation Lightweight Backbone for Remote Sensing
Yuxuan Li
Xiang Li
Yimain Dai
Qibin Hou
Li Liu
Yongxiang Liu
Ming-Ming Cheng
Jian Yang
44
32
0
18 Mar 2024
LLaVA-UHD: an LMM Perceiving Any Aspect Ratio and High-Resolution Images
Ruyi Xu
Yuan Yao
Zonghao Guo
Junbo Cui
Zanlin Ni
Chunjiang Ge
Tat-Seng Chua
Zhiyuan Liu
Maosong Sun
Gao Huang
VLM
MLLM
37
105
0
18 Mar 2024
Prioritized Semantic Learning for Zero-shot Instance Navigation
Xander Sun
Louis Lau
Hoyard Zhi
Ronghe Qiu
Junwei Liang
45
8
0
18 Mar 2024
Boosting Continual Learning of Vision-Language Models via Mixture-of-Experts Adapters
Jiazuo Yu
Yunzhi Zhuge
Lu Zhang
Ping Hu
Dong Wang
Huchuan Lu
You He
VLM
KELM
CLL
OODD
124
71
0
18 Mar 2024
Visual Preference Inference: An Image Sequence-Based Preference Reasoning in Tabletop Object Manipulation
Joonhyung Lee
Sangbeom Park
Yongin Kwon
Jemin Lee
Minwook Ahn
Sungjoon Choi
34
0
0
18 Mar 2024
VideoAgent: A Memory-augmented Multimodal Agent for Video Understanding
Yue Fan
Xiaojian Ma
Rujie Wu
Yuntao Du
Jiaqi Li
Zhi Gao
Qing Li
VLM
LLMAG
51
57
0
18 Mar 2024
Scene-LLM: Extending Language Model for 3D Visual Understanding and Reasoning
Rao Fu
Jingyu Liu
Xilun Chen
Yixin Nie
Wenhan Xiong
LM&Ro
LRM
55
55
0
18 Mar 2024
SQ-LLaVA: Self-Questioning for Large Vision-Language Assistant
Guohao Sun
Can Qin
Jiamian Wang
Zeyuan Chen
Ran Xu
Zhiqiang Tao
MLLM
VLM
LRM
42
9
0
17 Mar 2024
Training A Small Emotional Vision Language Model for Visual Art Comprehension
Jing Zhang
Liang Zheng
Meng Wang
Dan Guo
VLM
35
4
0
17 Mar 2024
LuoJiaHOG: A Hierarchy Oriented Geo-aware Image Caption Dataset for Remote Sensing Image-Text Retrival
Yuanxin Zhao
Mi Zhang
Bingnan Yang
Zhan Zhang
Jiaju Kang
Jianya Gong
35
2
0
16 Mar 2024
A Comprehensive Study of Multimodal Large Language Models for Image Quality Assessment
Tianhe Wu
Kede Ma
Jie Liang
Yujiu Yang
Lei Zhang
34
19
0
16 Mar 2024
Previous
1
2
3
...
47
48
49
...
64
65
66
Next