ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2304.08485
  4. Cited By
Visual Instruction Tuning

Visual Instruction Tuning

17 April 2023
Haotian Liu
Chunyuan Li
Qingyang Wu
Yong Jae Lee
    SyDa
    VLM
    MLLM
ArXivPDFHTML

Papers citing "Visual Instruction Tuning"

50 / 2,162 papers shown
Title
ImageNet3D: Towards General-Purpose Object-Level 3D Understanding
ImageNet3D: Towards General-Purpose Object-Level 3D Understanding
Wufei Ma
Guanning Zeng
Guofeng Zhang
Qihao Liu
Letian Zhang
Adam Kortylewski
Yaoyao Liu
Alan Yuille
VLM
3DV
41
7
0
13 Jun 2024
VideoGPT+: Integrating Image and Video Encoders for Enhanced Video
  Understanding
VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding
Muhammad Maaz
H. Rasheed
Salman Khan
Fahad A Khan
VLM
MLLM
27
49
0
13 Jun 2024
Depth Anything V2
Depth Anything V2
Lihe Yang
Bingyi Kang
Zilong Huang
Zhen Zhao
Xiaogang Xu
Jiashi Feng
Hengshuang Zhao
DiffM
VLM
MDE
59
323
0
13 Jun 2024
Explore the Limits of Omni-modal Pretraining at Scale
Explore the Limits of Omni-modal Pretraining at Scale
Yiyuan Zhang
Handong Li
Jing Liu
Xiangyu Yue
VLM
LRM
45
1
0
13 Jun 2024
MuirBench: A Comprehensive Benchmark for Robust Multi-image
  Understanding
MuirBench: A Comprehensive Benchmark for Robust Multi-image Understanding
Fei Wang
Xingyu Fu
James Y. Huang
Zekun Li
Qin Liu
...
Kai-Wei Chang
Dan Roth
Sheng Zhang
Hoifung Poon
Muhao Chen
VLM
36
47
0
13 Jun 2024
MMScan: A Multi-Modal 3D Scene Dataset with Hierarchical Grounded
  Language Annotations
MMScan: A Multi-Modal 3D Scene Dataset with Hierarchical Grounded Language Annotations
Ruiyuan Lyu
Tai Wang
Jingli Lin
Shuai Yang
Xiaohan Mao
...
Runsen Xu
Haifeng Huang
Chenming Zhu
Dahua Lin
Jiangmiao Pang
3DV
36
9
0
13 Jun 2024
Yo'LLaVA: Your Personalized Language and Vision Assistant
Yo'LLaVA: Your Personalized Language and Vision Assistant
Thao Nguyen
Haotian Liu
Yuheng Li
Mu Cai
Utkarsh Ojha
Yong Jae Lee
VLM
MLLM
51
15
0
13 Jun 2024
Aligning Vision Models with Human Aesthetics in Retrieval: Benchmarks
  and Algorithms
Aligning Vision Models with Human Aesthetics in Retrieval: Benchmarks and Algorithms
Miaosen Zhang
Yixuan Wei
Zhen Xing
Yifei Ma
Zuxuan Wu
...
Zheng-Wei Zhang
Qi Dai
Chong Luo
Xin Geng
Baining Guo
VLM
46
1
0
13 Jun 2024
Towards Vision-Language Geo-Foundation Model: A Survey
Towards Vision-Language Geo-Foundation Model: A Survey
Yue Zhou
Litong Feng
Yiping Ke
Xue Jiang
Junchi Yan
Xue Yang
Wayne Zhang
42
15
0
13 Jun 2024
AlignMMBench: Evaluating Chinese Multimodal Alignment in Large
  Vision-Language Models
AlignMMBench: Evaluating Chinese Multimodal Alignment in Large Vision-Language Models
Yuhang Wu
Wenmeng Yu
Yean Cheng
Yan Wang
Xiaohan Zhang
Jiazheng Xu
Ming Ding
Yuxiao Dong
48
1
0
13 Jun 2024
OpenVLA: An Open-Source Vision-Language-Action Model
OpenVLA: An Open-Source Vision-Language-Action Model
Moo Jin Kim
Karl Pertsch
Siddharth Karamcheti
Ted Xiao
Ashwin Balakrishna
...
Russ Tedrake
Dorsa Sadigh
Sergey Levine
Percy Liang
Chelsea Finn
LM&Ro
VLM
37
360
0
13 Jun 2024
Comparison Visual Instruction Tuning
Comparison Visual Instruction Tuning
Wei Lin
M. Jehanzeb Mirza
Sivan Doveh
Rogerio Feris
Raja Giryes
Sepp Hochreiter
Leonid Karlinsky
46
4
0
13 Jun 2024
ReMI: A Dataset for Reasoning with Multiple Images
ReMI: A Dataset for Reasoning with Multiple Images
Mehran Kazemi
Nishanth Dikkala
Ankit Anand
Petar Dević
Ishita Dasgupta
...
Bahare Fatemi
Pranjal Awasthi
Dee Guo
Sreenivas Gollapudi
Ahmed Qureshi
LRM
VLM
34
13
0
13 Jun 2024
INS-MMBench: A Comprehensive Benchmark for Evaluating LVLMs' Performance
  in Insurance
INS-MMBench: A Comprehensive Benchmark for Evaluating LVLMs' Performance in Insurance
Chenwei Lin
Hanjia Lyu
Xian Xu
Jiebo Luo
30
1
0
13 Jun 2024
Delta-CoMe: Training-Free Delta-Compression with Mixed-Precision for
  Large Language Models
Delta-CoMe: Training-Free Delta-Compression with Mixed-Precision for Large Language Models
Bowen Ping
Shuo Wang
Hanqing Wang
Xu Han
Yuzhuang Xu
Yukun Yan
Yun Chen
Baobao Chang
Zhiyuan Liu
Maosong Sun
MQ
43
4
0
13 Jun 2024
Cognitively Inspired Energy-Based World Models
Cognitively Inspired Energy-Based World Models
Alexi Gladstone
Ganesh Nanduru
Md. Mofijul Islam
Aman Chadha
Jundong Li
Tariq Iqbal
31
0
0
13 Jun 2024
DubWise: Video-Guided Speech Duration Control in Multimodal LLM-based
  Text-to-Speech for Dubbing
DubWise: Video-Guided Speech Duration Control in Multimodal LLM-based Text-to-Speech for Dubbing
Neha Sahipjohn
Ashishkumar Gudmalwar
Nirmesh Shah
Pankaj Wasnik
R. Shah
37
5
0
13 Jun 2024
VLind-Bench: Measuring Language Priors in Large Vision-Language Models
VLind-Bench: Measuring Language Priors in Large Vision-Language Models
Kang-il Lee
Minbeom Kim
Seunghyun Yoon
Minsung Kim
Dongryeol Lee
Hyukhun Koh
Kyomin Jung
CoGe
VLM
84
5
0
13 Jun 2024
MMFakeBench: A Mixed-Source Multimodal Misinformation Detection Benchmark for LVLMs
MMFakeBench: A Mixed-Source Multimodal Misinformation Detection Benchmark for LVLMs
Xuannan Liu
Zekun Li
Peipei Li
Shuhan Xia
Xing Cui
Linzhi Huang
Huaibo Huang
Weihong Deng
Zhaofeng He
36
13
0
13 Jun 2024
Too Many Frames, Not All Useful: Efficient Strategies for Long-Form Video QA
Too Many Frames, Not All Useful: Efficient Strategies for Long-Form Video QA
Jongwoo Park
Kanchana Ranasinghe
Kumara Kahatapitiya
Wonjeong Ryoo
Donghyun Kim
Michael S. Ryoo
60
20
0
13 Jun 2024
Ad Auctions for LLMs via Retrieval Augmented Generation
Ad Auctions for LLMs via Retrieval Augmented Generation
Mohammadtaghi Hajiaghayi
Sébastien Lahaie
Keivan Rezaei
Suho Shin
38
6
0
12 Jun 2024
Advancing High Resolution Vision-Language Models in Biomedicine
Advancing High Resolution Vision-Language Models in Biomedicine
Zekai Chen
Arda Pekis
Kevin Brown
MedIm
LM&MA
16
4
0
12 Jun 2024
Beyond LLaVA-HD: Diving into High-Resolution Large Multimodal Models
Beyond LLaVA-HD: Diving into High-Resolution Large Multimodal Models
Yi-Fan Zhang
Qingsong Wen
Chaoyou Fu
Xue Wang
Zhang Zhang
L. Wang
Rong Jin
34
40
0
12 Jun 2024
What If We Recaption Billions of Web Images with LLaMA-3?
What If We Recaption Billions of Web Images with LLaMA-3?
Xianhang Li
Haoqin Tu
Mude Hui
Zeyu Wang
Bingchen Zhao
...
Jieru Mei
Qing Liu
Huangjie Zheng
Yuyin Zhou
Cihang Xie
VLM
MLLM
36
35
0
12 Jun 2024
OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images
  Interleaved with Text
OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text
Qingyun Li
Zhe Chen
Weiyun Wang
Wenhai Wang
Shenglong Ye
...
Dahua Lin
Yu Qiao
Botian Shi
Conghui He
Jifeng Dai
VLM
OffRL
51
20
0
12 Jun 2024
MMWorld: Towards Multi-discipline Multi-faceted World Model Evaluation
  in Videos
MMWorld: Towards Multi-discipline Multi-faceted World Model Evaluation in Videos
Xuehai He
Weixi Feng
Kaizhi Zheng
Yujie Lu
Wanrong Zhu
...
Zhengyuan Yang
Kevin Lin
William Yang Wang
Lijuan Wang
Xin Eric Wang
VGen
LRM
36
12
0
12 Jun 2024
DDR: Exploiting Deep Degradation Response as Flexible Image Descriptor
DDR: Exploiting Deep Degradation Response as Flexible Image Descriptor
Juncheng Wu
Zhangkai Ni
Hanli Wang
Wenhan Yang
Yuyin Zhou
Shiqi Wang
33
1
0
12 Jun 2024
OpenCOLE: Towards Reproducible Automatic Graphic Design Generation
OpenCOLE: Towards Reproducible Automatic Graphic Design Generation
Naoto Inoue
Kento Masui
Wataru Shimoda
Kota Yamaguchi
24
8
0
12 Jun 2024
One-Step Effective Diffusion Network for Real-World Image
  Super-Resolution
One-Step Effective Diffusion Network for Real-World Image Super-Resolution
Rongyuan Wu
Lingchen Sun
Zhiyuan Ma
Lei Zhang
DiffM
28
51
0
12 Jun 2024
ConMe: Rethinking Evaluation of Compositional Reasoning for Modern VLMs
ConMe: Rethinking Evaluation of Compositional Reasoning for Modern VLMs
Irene Huang
Wei Lin
M. Jehanzeb Mirza
Jacob A. Hansen
Sivan Doveh
...
Trevor Darrel
Chuang Gan
Aude Oliva
Rogerio Feris
Leonid Karlinsky
CoGe
LRM
30
7
0
12 Jun 2024
Flash-VStream: Memory-Based Real-Time Understanding for Long Video
  Streams
Flash-VStream: Memory-Based Real-Time Understanding for Long Video Streams
Haoji Zhang
Yiqin Wang
Yansong Tang
Yong-Jin Liu
Jiashi Feng
Jifeng Dai
Xiaojie Jin
32
37
0
12 Jun 2024
A Concept-Based Explainability Framework for Large Multimodal Models
A Concept-Based Explainability Framework for Large Multimodal Models
Jayneel Parekh
Pegah Khayatan
Mustafa Shukor
A. Newson
Matthieu Cord
32
16
0
12 Jun 2024
Large Language Models Meet Text-Centric Multimodal Sentiment Analysis: A
  Survey
Large Language Models Meet Text-Centric Multimodal Sentiment Analysis: A Survey
Hao-Yu Yang
Yanyan Zhao
Yang Wu
Shilong Wang
Tian Zheng
Hongbo Zhang
Zongyang Ma
Wanxiang Che
Bing Qin
34
8
0
12 Jun 2024
Fewer Tokens and Fewer Videos: Extending Video Understanding Abilities
  in Large Vision-Language Models
Fewer Tokens and Fewer Videos: Extending Video Understanding Abilities in Large Vision-Language Models
Shimin Chen
Yitian Yuan
Shaoxiang Chen
Zequn Jie
Lin Ma
VLM
27
3
0
12 Jun 2024
ROADWork Dataset: Learning to Recognize, Observe, Analyze and Drive
  Through Work Zones
ROADWork Dataset: Learning to Recognize, Observe, Analyze and Drive Through Work Zones
Anurag Ghosh
R. Tamburo
Shen Zheng
Juan R. Alvarez-Padilla
Hailiang Zhu
Michael Cardei
Nicholas Dunn
Christoph Mertz
Srinivasa G. Narasimhan
39
1
0
11 Jun 2024
An Image is Worth 32 Tokens for Reconstruction and Generation
An Image is Worth 32 Tokens for Reconstruction and Generation
Qihang Yu
Mark Weber
XueQing Deng
Xiaohui Shen
Daniel Cremers
Liang-Chieh Chen
VLM
ViT
44
79
0
11 Jun 2024
Vision Model Pre-training on Interleaved Image-Text Data via Latent
  Compression Learning
Vision Model Pre-training on Interleaved Image-Text Data via Latent Compression Learning
Chenyu Yang
Xizhou Zhu
Jinguo Zhu
Weijie Su
Junjie Wang
...
Lewei Lu
Bin Li
Jie Zhou
Yu Qiao
Jifeng Dai
VLM
CLIP
39
4
0
11 Jun 2024
Cognitive Insights Across Languages: Enhancing Multimodal Interview
  Analysis
Cognitive Insights Across Languages: Enhancing Multimodal Interview Analysis
David Ortiz-Perez
José García Rodríguez
David Tomás
20
1
0
11 Jun 2024
Understanding Visual Concepts Across Models
Understanding Visual Concepts Across Models
Brandon Trabucco
Max Gurinas
Kyle Doherty
Ruslan Salakhutdinov
VLM
35
0
0
11 Jun 2024
MLLMGuard: A Multi-dimensional Safety Evaluation Suite for Multimodal
  Large Language Models
MLLMGuard: A Multi-dimensional Safety Evaluation Suite for Multimodal Large Language Models
Tianle Gu
Zeyang Zhou
Kexin Huang
Dandan Liang
Yixu Wang
...
Keqing Wang
Yujiu Yang
Yan Teng
Yu Qiao
Yingchun Wang
ELM
42
9
0
11 Jun 2024
Needle In A Multimodal Haystack
Needle In A Multimodal Haystack
Weiyun Wang
Shuibo Zhang
Yiming Ren
Yuchen Duan
Tiantong Li
...
Ping Luo
Yu Qiao
Jifeng Dai
Wenqi Shao
Wenhai Wang
VLM
57
17
0
11 Jun 2024
Can Foundation Models Reliably Identify Spatial Hazards? A Case Study on
  Curb Segmentation
Can Foundation Models Reliably Identify Spatial Hazards? A Case Study on Curb Segmentation
Diwei Sheng
Giles Hamilton-Fletcher
Mahya Beheshti
Chen Feng
John-Ross Rizzo
21
2
0
11 Jun 2024
FaceGPT: Self-supervised Learning to Chat about 3D Human Faces
FaceGPT: Self-supervised Learning to Chat about 3D Human Faces
Haoran Wang
Mohit Mendiratta
Christian Theobalt
Adam Kortylewski
3DH
CVBM
31
3
0
11 Jun 2024
RS-Agent: Automating Remote Sensing Tasks through Intelligent Agents
RS-Agent: Automating Remote Sensing Tasks through Intelligent Agents
Wenjia Xu
Zijian Yu
Yixu Wang
Jiuniu Wang
Mugen Peng
LLMAG
48
7
0
11 Jun 2024
FoodSky: A Food-oriented Large Language Model that Passes the Chef and
  Dietetic Examination
FoodSky: A Food-oriented Large Language Model that Passes the Chef and Dietetic Examination
Pengfei Zhou
Weiqing Min
Chaoran Fu
Ying Jin
Mingyu Huang
Xiangyang Li
Shuhuan Mei
Shuqiang Jiang
33
8
0
11 Jun 2024
Autoregressive Model Beats Diffusion: Llama for Scalable Image
  Generation
Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation
Peize Sun
Yi Jiang
Shoufa Chen
Shilong Zhang
Bingyue Peng
Ping Luo
Zehuan Yuan
VLM
60
220
0
10 Jun 2024
AID: Adapting Image2Video Diffusion Models for Instruction-guided Video
  Prediction
AID: Adapting Image2Video Diffusion Models for Instruction-guided Video Prediction
Zhen Xing
Qi Dai
Zejia Weng
Zuxuan Wu
Yu-Gang Jiang
VGen
39
14
0
10 Jun 2024
iMotion-LLM: Motion Prediction Instruction Tuning
iMotion-LLM: Motion Prediction Instruction Tuning
Abdulwahab Felemban
Eslam Mohamed Bakr
Xiaoqian Shen
Jian Ding
Abduallah A. Mohamed
Mohamed Elhoseiny
45
1
0
10 Jun 2024
Can I understand what I create? Self-Knowledge Evaluation of Large
  Language Models
Can I understand what I create? Self-Knowledge Evaluation of Large Language Models
Zhiquan Tan
Lai Wei
Jindong Wang
Xing Xie
Weiran Huang
ELM
LRM
35
5
0
10 Jun 2024
GAIA: Rethinking Action Quality Assessment for AI-Generated Videos
GAIA: Rethinking Action Quality Assessment for AI-Generated Videos
Zijian Chen
Wei Sun
Yuan Tian
Jun Jia
Zicheng Zhang
Jiarui Wang
Ru Huang
Xiongkuo Min
Guangtao Zhai
Wenjun Zhang
EGVM
48
10
0
10 Jun 2024
Previous
123...363738...424344
Next