ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2304.08485
  4. Cited By
Visual Instruction Tuning

Visual Instruction Tuning

17 April 2023
Haotian Liu
Chunyuan Li
Qingyang Wu
Yong Jae Lee
    SyDa
    VLM
    MLLM
ArXivPDFHTML

Papers citing "Visual Instruction Tuning"

50 / 2,160 papers shown
Title
Learning Manipulation by Predicting Interaction
Learning Manipulation by Predicting Interaction
Jia Zeng
Qingwen Bu
Bangjun Wang
Wenke Xia
Li Chen
...
Heming Cui
Bin Zhao
Xuelong Li
Yu Qiao
Hongyang Li
48
19
0
01 Jun 2024
Artemis: Towards Referential Understanding in Complex Videos
Artemis: Towards Referential Understanding in Complex Videos
Jihao Qiu
Yuan Zhang
Xi Tang
Lingxi Xie
Tianren Ma
Pengyu Yan
David Doermann
Qixiang Ye
Yunjie Tian
VLM
VGen
37
8
0
01 Jun 2024
Evaluating Uncertainty-based Failure Detection for Closed-Loop LLM Planners
Evaluating Uncertainty-based Failure Detection for Closed-Loop LLM Planners
Zhi Zheng
Qian Feng
Hang Li
Alois C. Knoll
Jianxiang Feng
46
6
0
01 Jun 2024
Empowering Visual Creativity: A Vision-Language Assistant to Image
  Editing Recommendations
Empowering Visual Creativity: A Vision-Language Assistant to Image Editing Recommendations
Tiancheng Shen
Jun Hao Liew
Long Mai
Lu Qi
Jiashi Feng
Jiaya Jia
DiffM
30
1
0
31 May 2024
Video-MME: The First-Ever Comprehensive Evaluation Benchmark of
  Multi-modal LLMs in Video Analysis
Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis
Chaoyou Fu
Yuhan Dai
Yondong Luo
Lei Li
Shuhuai Ren
...
Tong Bill Xu
Xiawu Zheng
Enhong Chen
Rongrong Ji
Xing Sun
VLM
MLLM
48
297
0
31 May 2024
Kaleido Diffusion: Improving Conditional Diffusion Models with
  Autoregressive Latent Modeling
Kaleido Diffusion: Improving Conditional Diffusion Models with Autoregressive Latent Modeling
Jiatao Gu
Ying Shen
Shuangfei Zhai
Yizhe Zhang
Navdeep Jaitly
J. Susskind
42
10
0
31 May 2024
StrucTexTv3: An Efficient Vision-Language Model for Text-rich Image
  Perception, Comprehension, and Beyond
StrucTexTv3: An Efficient Vision-Language Model for Text-rich Image Perception, Comprehension, and Beyond
Pengyuan Lyu
Yulin Li
Hao Zhou
Weihong Ma
Xingyu Wan
...
Liang Wu
Chengquan Zhang
Kun Yao
Errui Ding
Jingdong Wang
36
7
0
31 May 2024
DeCo: Decoupling Token Compression from Semantic Abstraction in
  Multimodal Large Language Models
DeCo: Decoupling Token Compression from Semantic Abstraction in Multimodal Large Language Models
Linli Yao
Lei Li
Shuhuai Ren
Lean Wang
Yuanxin Liu
Xu Sun
Lu Hou
35
28
0
31 May 2024
OR-Bench: An Over-Refusal Benchmark for Large Language Models
OR-Bench: An Over-Refusal Benchmark for Large Language Models
Justin Cui
Wei-Lin Chiang
Ion Stoica
Cho-Jui Hsieh
ALM
38
33
0
31 May 2024
Retrieval Meets Reasoning: Even High-school Textbook Knowledge Benefits
  Multimodal Reasoning
Retrieval Meets Reasoning: Even High-school Textbook Knowledge Benefits Multimodal Reasoning
Cheng Tan
Jingxuan Wei
Linzhuang Sun
Zhangyang Gao
Siyuan Li
Bihui Yu
Ruifeng Guo
Stan Z. Li
ReLM
LRM
3DV
64
6
0
31 May 2024
Ovis: Structural Embedding Alignment for Multimodal Large Language Model
Ovis: Structural Embedding Alignment for Multimodal Large Language Model
Shiyin Lu
Yang Li
Qing-Guo Chen
Zhao Xu
Weihua Luo
Kaifu Zhang
Han-Jia Ye
VLM
MLLM
53
35
0
31 May 2024
Joint Embeddings for Graph Instruction Tuning
Joint Embeddings for Graph Instruction Tuning
Vlad Argatu
Aaron Haag
Oliver Lohse
36
0
0
31 May 2024
Shotluck Holmes: A Family of Efficient Small-Scale Large Language Vision
  Models For Video Captioning and Summarization
Shotluck Holmes: A Family of Efficient Small-Scale Large Language Vision Models For Video Captioning and Summarization
Richard Luo
Austin Peng
Adithya Vasudev
Rishabh Jain
34
2
0
31 May 2024
Vision-Language Meets the Skeleton: Progressively Distillation with
  Cross-Modal Knowledge for 3D Action Representation Learning
Vision-Language Meets the Skeleton: Progressively Distillation with Cross-Modal Knowledge for 3D Action Representation Learning
Yang Chen
Tian He
Junfeng Fu
Ling Wang
Jingcai Guo
Hong Cheng
VLM
26
2
0
31 May 2024
LCQ: Low-Rank Codebook based Quantization for Large Language Models
LCQ: Low-Rank Codebook based Quantization for Large Language Models
Wen-Pu Cai
Wu-Jun Li
Wu-Jun Li
MQ
30
0
0
31 May 2024
MotionLLM: Understanding Human Behaviors from Human Motions and Videos
MotionLLM: Understanding Human Behaviors from Human Motions and Videos
Ling-Hao Chen
Shunlin Lu
Ailing Zeng
Hao Zhang
Benyou Wang
Ruimao Zhang
Lei Zhang
45
34
0
30 May 2024
Visual Perception by Large Language Model's Weights
Visual Perception by Large Language Model's Weights
Feipeng Ma
Hongwei Xue
Guangting Wang
Yizhou Zhou
Fengyun Rao
Shilin Yan
Yueyi Zhang
Siying Wu
Mike Zheng Shou
Xiaoyan Sun
VLM
25
5
0
30 May 2024
Typography Leads Semantic Diversifying: Amplifying Adversarial
  Transferability across Multimodal Large Language Models
Typography Leads Semantic Diversifying: Amplifying Adversarial Transferability across Multimodal Large Language Models
Hao-Ran Cheng
Erjia Xiao
Jiahang Cao
Le Yang
Kaidi Xu
Jindong Gu
Renjing Xu
AAML
55
7
0
30 May 2024
NoiseBoost: Alleviating Hallucination with Noise Perturbation for
  Multimodal Large Language Models
NoiseBoost: Alleviating Hallucination with Noise Perturbation for Multimodal Large Language Models
Kai Wu
Boyuan Jiang
Zhengkai Jiang
Qingdong He
Donghao Luo
Shengzhi Wang
Qingwen Liu
Chengjie Wang
VLM
MLLM
30
3
0
30 May 2024
Efficient LLM-Jailbreaking by Introducing Visual Modality
Efficient LLM-Jailbreaking by Introducing Visual Modality
Zhenxing Niu
Yuyao Sun
Haodong Ren
Haoxuan Ji
Quan Wang
Xiaoke Ma
Gang Hua
Rong Jin
33
0
0
30 May 2024
Instruction-Guided Visual Masking
Instruction-Guided Visual Masking
Jinliang Zheng
Jianxiong Li
Si Cheng
Yinan Zheng
Jiaming Li
Jihao Liu
Yu Liu
Jingjing Liu
Xianyuan Zhan
45
5
0
30 May 2024
Streaming Video Diffusion: Online Video Editing with Diffusion Models
Streaming Video Diffusion: Online Video Editing with Diffusion Models
Feng Chen
Zhen Yang
Bohan Zhuang
Qi Wu
DiffM
41
3
0
30 May 2024
Enhancing Large Vision Language Models with Self-Training on Image
  Comprehension
Enhancing Large Vision Language Models with Self-Training on Image Comprehension
Yihe Deng
Pan Lu
Fan Yin
Ziniu Hu
Sheng Shen
James Y. Zou
Kai-Wei Chang
Wei Wang
SyDa
VLM
LRM
36
36
0
30 May 2024
Bridging Model-Based Optimization and Generative Modeling via
  Conservative Fine-Tuning of Diffusion Models
Bridging Model-Based Optimization and Generative Modeling via Conservative Fine-Tuning of Diffusion Models
Masatoshi Uehara
Yulai Zhao
Ehsan Hajiramezanali
Gabriele Scalia
Gökçen Eraslan
Avantika Lal
Sergey Levine
Tommaso Biancalani
45
13
0
30 May 2024
SAM-E: Leveraging Visual Foundation Model with Sequence Imitation for
  Embodied Manipulation
SAM-E: Leveraging Visual Foundation Model with Sequence Imitation for Embodied Manipulation
Junjie Zhang
Chenjia Bai
Haoran He
Wenke Xia
Zhigang Wang
Bin Zhao
Xiu Li
Xuelong Li
35
12
0
30 May 2024
Source Code Foundation Models are Transferable Binary Analysis Knowledge
  Bases
Source Code Foundation Models are Transferable Binary Analysis Knowledge Bases
Zian Su
Xiangzhe Xu
Ziyang Huang
Kaiyuan Zhang
Xiangyu Zhang
32
5
0
30 May 2024
Uncovering Bias in Large Vision-Language Models at Scale with Counterfactuals
Uncovering Bias in Large Vision-Language Models at Scale with Counterfactuals
Phillip Howard
Kathleen C. Fraser
Anahita Bhiwandiwalla
S. Kiritchenko
48
9
0
30 May 2024
Dr-LLaVA: Visual Instruction Tuning with Symbolic Clinical Grounding
Dr-LLaVA: Visual Instruction Tuning with Symbolic Clinical Grounding
Shenghuan Sun
Gregory M. Goldgof
Alexander Schubert
Zhiqing Sun
Thomas Hartvigsen
A. Butte
Ahmed Alaa
LM&MA
27
4
0
29 May 2024
X-VILA: Cross-Modality Alignment for Large Language Model
X-VILA: Cross-Modality Alignment for Large Language Model
Hanrong Ye
De-An Huang
Yao Lu
Zhiding Yu
Wei Ping
...
Jan Kautz
Song Han
Dan Xu
Pavlo Molchanov
Hongxu Yin
MLLM
VLM
40
29
0
29 May 2024
Multi-Modal Generative Embedding Model
Multi-Modal Generative Embedding Model
Feipeng Ma
Hongwei Xue
Guangting Wang
Yizhou Zhou
Fengyun Rao
Shilin Yan
Yueyi Zhang
Siying Wu
Mike Zheng Shou
Xiaoyan Sun
VLM
26
3
0
29 May 2024
Reasoning3D -- Grounding and Reasoning in 3D: Fine-Grained Zero-Shot
  Open-Vocabulary 3D Reasoning Part Segmentation via Large Vision-Language
  Models
Reasoning3D -- Grounding and Reasoning in 3D: Fine-Grained Zero-Shot Open-Vocabulary 3D Reasoning Part Segmentation via Large Vision-Language Models
Tianrun Chen
Chunan Yu
Jing Li
Jianqi Zhang
Lanyun Zhu
Deyi Ji
Yong Zhang
Ying-Dong Zang
Zejian Li
Lingyun Sun
LRM
41
9
0
29 May 2024
Adaptive Image Quality Assessment via Teaching Large Multimodal Model to
  Compare
Adaptive Image Quality Assessment via Teaching Large Multimodal Model to Compare
Hanwei Zhu
Haoning Wu
Yixuan Li
Zicheng Zhang
Baoliang Chen
Lingyu Zhu
Yuming Fang
Guangtao Zhai
Weisi Lin
Shiqi Wang
38
18
0
29 May 2024
Voice Jailbreak Attacks Against GPT-4o
Voice Jailbreak Attacks Against GPT-4o
Xinyue Shen
Yixin Wu
Michael Backes
Yang Zhang
AuLLM
34
9
0
29 May 2024
Benchmarking and Improving Detail Image Caption
Benchmarking and Improving Detail Image Caption
Hongyuan Dong
Jiawen Li
Bohong Wu
Jiacong Wang
Yuan Zhang
Haoyuan Guo
VLM
MLLM
35
16
0
29 May 2024
Enhancing Vision-Language Model with Unmasked Token Alignment
Enhancing Vision-Language Model with Unmasked Token Alignment
Jihao Liu
Jinliang Zheng
Boxiao Liu
Yu Liu
Hongsheng Li
CLIP
24
0
0
29 May 2024
Kestrel: Point Grounding Multimodal LLM for Part-Aware 3D
  Vision-Language Understanding
Kestrel: Point Grounding Multimodal LLM for Part-Aware 3D Vision-Language Understanding
Junjie Fei
Mahmoud Ahmed
Jian Ding
Eslam Mohamed Bakr
Mohamed Elhoseiny
31
3
0
29 May 2024
Empowering Embodied Manipulation: A Bimanual-Mobile Robot Manipulation
  Dataset for Household Tasks
Empowering Embodied Manipulation: A Bimanual-Mobile Robot Manipulation Dataset for Household Tasks
Tianle Zhang
Dongjiang Li
Yihang Li
Zecui Zeng
Lin Zhao
...
Yue Chen
Xuelong Wei
Yibing Zhan
Lusong Li
Xiaodong He
22
7
0
29 May 2024
Descriptive Image Quality Assessment in the Wild
Descriptive Image Quality Assessment in the Wild
Zhiyuan You
Jinjin Gu
Zheyuan Li
Xin Cai
Kaiwen Zhu
Chao Dong
Tianfan Xue
EGVM
40
16
0
29 May 2024
Reverse Image Retrieval Cues Parametric Memory in Multimodal LLMs
Reverse Image Retrieval Cues Parametric Memory in Multimodal LLMs
Jialiang Xu
Michael Moor
J. Leskovec
27
2
0
29 May 2024
SketchDeco: Decorating B&W Sketches with Colour
SketchDeco: Decorating B&W Sketches with Colour
Chaitat Utintu
Pinaki Nath Chowdhury
Aneeshan Sain
Subhadeep Koley
A. Bhunia
Yi-Zhe Song
DiffM
34
3
0
29 May 2024
I See You: Teacher Analytics with GPT-4 Vision-Powered Observational
  Assessment
I See You: Teacher Analytics with GPT-4 Vision-Powered Observational Assessment
Unggi Lee
Yeil Jeong
Junbo Koh
Gyuri Byun
Yunseo Lee
Hyunwoong Lee
Seunmin Eun
Jewoong Moon
Cheolil Lim
Hyeoncheol Kim
9
2
0
28 May 2024
ViG: Linear-complexity Visual Sequence Learning with Gated Linear
  Attention
ViG: Linear-complexity Visual Sequence Learning with Gated Linear Attention
Bencheng Liao
Xinggang Wang
Lianghui Zhu
Qian Zhang
Chang Huang
45
4
0
28 May 2024
Why are Visually-Grounded Language Models Bad at Image Classification?
Why are Visually-Grounded Language Models Bad at Image Classification?
Yuhui Zhang
Alyssa Unell
Xiaohan Wang
Dhruba Ghosh
Yuchang Su
Ludwig Schmidt
Serena Yeung-Levy
VLM
35
27
0
28 May 2024
Instruct-MusicGen: Unlocking Text-to-Music Editing for Music Language
  Models via Instruction Tuning
Instruct-MusicGen: Unlocking Text-to-Music Editing for Music Language Models via Instruction Tuning
Yixiao Zhang
Yukara Ikemiya
Woosung Choi
Naoki Murata
Marco A. Martínez Ramírez
Liwei Lin
Gus Xia
Wei-Hsiang Liao
Yuki Mitsufuji
Simon Dixon
55
10
0
28 May 2024
Is a 3D-Tokenized LLM the Key to Reliable Autonomous Driving?
Is a 3D-Tokenized LLM the Key to Reliable Autonomous Driving?
Yifan Bai
Dongming Wu
Yingfei Liu
Fan Jia
Weixin Mao
...
Yucheng Zhao
Jianbing Shen
Xing Wei
Tiancai Wang
Xiangyu Zhang
MLLM
27
9
0
28 May 2024
Multi-modal Generation via Cross-Modal In-Context Learning
Multi-modal Generation via Cross-Modal In-Context Learning
Amandeep Kumar
Muzammal Naseer
Sanath Narayan
Rao Muhammad Anwer
Salman Khan
Hisham Cholakkal
MLLM
51
0
0
28 May 2024
Intent3D: 3D Object Detection in RGB-D Scans Based on Human Intention
Intent3D: 3D Object Detection in RGB-D Scans Based on Human Intention
Weitai Kang
Mengxue Qu
Jyoti Kini
Yunchao Wei
Mubarak Shah
Yan Yan
LM&Ro
3DPC
45
10
0
28 May 2024
The Evolution of Multimodal Model Architectures
The Evolution of Multimodal Model Architectures
S. Wadekar
Abhishek Chaurasia
Aman Chadha
Eugenio Culurciello
41
14
0
28 May 2024
White-box Multimodal Jailbreaks Against Large Vision-Language Models
White-box Multimodal Jailbreaks Against Large Vision-Language Models
Ruofan Wang
Xingjun Ma
Hanxu Zhou
Chuanjun Ji
Guangnan Ye
Yu-Gang Jiang
AAML
VLM
41
17
0
28 May 2024
Seeing the Image: Prioritizing Visual Correlation by Contrastive
  Alignment
Seeing the Image: Prioritizing Visual Correlation by Contrastive Alignment
Xin Xiao
Bohong Wu
Jiacong Wang
Chunyuan Li
Xun Zhou
Haoyuan Guo
VLM
34
7
0
28 May 2024
Previous
123...383940...424344
Next