Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
2102.10407
Cited By
v1
v2
v3
v4
v5 (latest)
VisualGPT: Data-efficient Adaptation of Pretrained Language Models for Image Captioning
Computer Vision and Pattern Recognition (CVPR), 2021
20 February 2021
Jun Chen
Han Guo
Kai Yi
Boyang Albert Li
Mohamed Elhoseiny
VLM
Re-assign community
ArXiv (abs)
PDF
HTML
Github (331★)
Papers citing
"VisualGPT: Data-efficient Adaptation of Pretrained Language Models for Image Captioning"
50 / 165 papers shown
IWISDM: Assessing instruction following in multimodal models at scale
Xiaoxuan Lei
Lucas Gomez
Hao Yuan Bai
P. Bashivan
VLM
445
2
0
20 Jun 2024
Jailbreak Vision Language Models via Bi-Modal Adversarial Prompt
Zonghao Ying
Aishan Liu
Tianyuan Zhang
Zhengmin Yu
Yaning Tan
Xianglong Liu
Dacheng Tao
AAML
389
77
0
06 Jun 2024
Can't make an Omelette without Breaking some Eggs: Plausible Action Anticipation using Large Video-Language Models
Himangi Mittal
Nakul Agarwal
Shao-Yuan Lo
Kwonjoon Lee
297
29
0
30 May 2024
Lateralization MLP: A Simple Brain-inspired Architecture for Diffusion
Zizhao Hu
Mohammad Rostami
231
0
0
25 May 2024
Adversarial Robustness for Visual Grounding of Multimodal Large Language Models
Kuofeng Gao
Yang Bai
Jiawang Bai
Yong Yang
Shu-Tao Xia
AAML
245
25
0
16 May 2024
Chameleon: Mixed-Modal Early-Fusion Foundation Models
Chameleon Team
MLLM
586
634
0
16 May 2024
Learning Object States from Actions via Large Language Models
Masatoshi Tateno
Takuma Yagi
Ryosuke Furuta
Yoichi Sato
136
2
0
02 May 2024
ViOCRVQA: Novel Benchmark Dataset and Vision Reader for Visual Question Answering by Understanding Vietnamese Text in Images
Huy Quang Pham
Thang Kien-Bao Nguyen
Quan Van Nguyen
Dan Quang Tran
Nghia Hieu Nguyen
Kiet Van Nguyen
Ngan Luu-Thuy Nguyen
225
10
0
29 Apr 2024
Energy-Latency Manipulation of Multi-modal Large Language Models via Verbose Samples
Kuofeng Gao
Jindong Gu
Yang Bai
Shu-Tao Xia
Juil Sock
Wei Liu
Zhifeng Li
339
17
0
25 Apr 2024
Wiki-LLaVA: Hierarchical Retrieval-Augmented Generation for Multimodal LLMs
Davide Caffagni
Federico Cocchi
Nicholas Moratelli
Sara Sarto
Marcella Cornia
Lorenzo Baraldi
Rita Cucchiara
KELM
382
75
0
23 Apr 2024
Evolving Interpretable Visual Classifiers with Large Language Models
Mia Chiquier
Utkarsh Mall
Carl Vondrick
VLM
254
20
0
15 Apr 2024
What Are We Measuring When We Evaluate Large Vision-Language Models? An Analysis of Latent Factors and Biases
North American Chapter of the Association for Computational Linguistics (NAACL), 2024
A. M. H. Tiong
Junqi Zhao
Boyang Albert Li
Junnan Li
Guosheng Lin
Caiming Xiong
255
12
0
03 Apr 2024
Generative Multi-modal Models are Good Class-Incremental Learners
Xusheng Cao
Haori Lu
Linlan Huang
Xialei Liu
Ming-Ming Cheng
CLL
314
26
0
27 Mar 2024
EXAMS-V: A Multi-Discipline Multilingual Multimodal Exam Benchmark for Evaluating Vision Language Models
Annual Meeting of the Association for Computational Linguistics (ACL), 2024
Rocktim Jyoti Das
Simeon Emilov Hristov
Jinyan Su
Dimitar Iliyanov Dimitrov
Ivan Koychev
Preslav Nakov
CoGe
ELM
260
43
0
15 Mar 2024
CoIN: A Benchmark of Continual Instruction tuNing for Multimodel Large Language Model
Neural Information Processing Systems (NeurIPS), 2024
Cheng Chen
Sitong Su
Xu Luo
Hengtao Shen
Lianli Gao
Jingkuan Song
CLL
197
32
0
13 Mar 2024
Popeye: A Unified Visual-Language Model for Multi-Source Ship Detection from Remote Sensing Imagery
Wei Zhang
Miaoxin Cai
Tong Zhang
Guoqiang Lei
Zhuang Yin
Xuerui Mao
211
15
0
06 Mar 2024
Attention Guidance Mechanism for Handwritten Mathematical Expression Recognition
Yutian Liu
Wenjun Ke
Jianguo Wei
297
1
0
04 Mar 2024
Retrieval-Augmented Generation for AI-Generated Content: A Survey
Penghao Zhao
Hailin Zhang
Qinhan Yu
Zhengren Wang
Yunteng Geng
Fangcheng Fu
Ling Yang
Wentao Zhang
Jie Jiang
Tengjiao Wang
3DV
958
454
0
29 Feb 2024
ToolNet: Connecting Large Language Models with Massive Tools via Tool Graph
Xukun Liu
Zhiyuan Peng
Xiaoyuan Yi
Xing Xie
Lirong Xiang
Yuchen Liu
Dongkuan Xu
CLL
LLMAG
172
45
0
29 Feb 2024
From Summary to Action: Enhancing Large Language Models for Complex Tasks with Open World APIs
Yulong Liu
Yunlong Yuan
Chunwei Wang
Jianhua Han
Yongqiang Ma
Li Zhang
Nanning Zheng
Hang Xu
LLMAG
141
11
0
28 Feb 2024
Visual Hallucinations of Multi-modal Large Language Models
Wen Huang
Hongbin Liu
Minxin Guo
Neil Zhenqiang Gong
MLLM
VLM
286
59
0
22 Feb 2024
LVCHAT: Facilitating Long Video Comprehension
Yu Wang
Zeyuan Zhang
Julian McAuley
Zexue He
VLM
151
6
0
19 Feb 2024
Describing Images
Fast
and
Slow
\textit{Fast and Slow}
Fast and Slow
: Quantifying and Predicting the Variation in Human Signals during Visuo-Linguistic Processes
Ece Takmaz
Sandro Pezzelle
Raquel Fernández
133
1
0
02 Feb 2024
MouSi: Poly-Visual-Expert Vision-Language Models
Xiaoran Fan
Changzhi Sun
Changhao Jiang
Shuo Li
Senjie Jin
...
Tao Gui
Xipeng Qiu
Xuanjing Huang
Zuxuan Wu
Yunchun Jiang
VLM
159
24
0
30 Jan 2024
EarthGPT: A Universal Multi-modal Large Language Model for Multi-sensor Image Comprehension in Remote Sensing Domain
Wei Zhang
Miaoxin Cai
Tong Zhang
Zhuang Yin
Xuerui Mao
433
214
0
30 Jan 2024
BETA: Binarized Energy-Efficient Transformer Accelerator at the Edge
International Symposium on Circuits and Systems (ISCAS), 2024
Yuhao Ji
Chao Fang
Zhongfeng Wang
239
8
0
22 Jan 2024
Inducing High Energy-Latency of Large Vision-Language Models with Verbose Images
International Conference on Learning Representations (ICLR), 2024
Kuofeng Gao
Yang Bai
Jindong Gu
Shu-Tao Xia
Juil Sock
Zhifeng Li
Wei Liu
VLM
216
65
0
20 Jan 2024
Veagle: Advancements in Multimodal Representation Learning
Rajat Chawla
Arkajit Datta
Tushar Verma
Adarsh Jha
Anmol Gautam
Ayush Vatsal
Sukrit Chaterjee
NS Mukunda
Ishaan Bhola
VLM
175
5
0
18 Jan 2024
Cross-Attention Watermarking of Large Language Models
IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2024
Folco Bertini Baldassini
H. Nguyen
Ching-Chung Chang
Isao Echizen
WaLM
140
4
0
12 Jan 2024
Pre-trained Model Guided Fine-Tuning for Zero-Shot Adversarial Robustness
Computer Vision and Pattern Recognition (CVPR), 2024
Sibo Wang
Jie Zhang
Zheng Yuan
Shiguang Shan
VLM
348
46
0
09 Jan 2024
Benchmarking PathCLIP for Pathology Image Analysis
Sunyi Zheng
Xiaonan Cui
Yuxuan Sun
Jingxiong Li
Honglin Li
Yunlong Zhang
Pingyi Chen
Xueping Jing
Zhaoxiang Ye
Lin Yang
VLM
179
14
0
05 Jan 2024
ChartBench: A Benchmark for Complex Visual Reasoning in Charts
Zhengzhuo Xu
Sinan Du
Yiyan Qi
Chengjin Xu
Chun Yuan
Jian Guo
430
86
0
26 Dec 2023
Depicting Beyond Scores: Advancing Image Quality Assessment through Multi-modal Language Models
European Conference on Computer Vision (ECCV), 2023
Zhiyuan You
Zheyuan Li
Jinjin Gu
Zhenfei Yin
Tianfan Xue
Chao Dong
EGVM
399
90
0
14 Dec 2023
Learning Hierarchical Prompt with Structured Linguistic Knowledge for Vision-Language Models
Yubin Wang
Xinyang Jiang
De Cheng
Dongsheng Li
Cairong Zhao
VLM
141
40
0
11 Dec 2023
Large Scale Foundation Models for Intelligent Manufacturing Applications: A Survey
Haotian Zhang
S. D. Semujju
Zhicheng Wang
Xianwei Lv
Kang Xu
...
Jing Wu
Zhuo Long
Zhicheng Wang
Xiaoguang Ma
Wensheng Liang
UQCV
AI4TS
AI4CE
364
25
0
11 Dec 2023
Causal-CoG: A Causal-Effect Look at Context Generation for Boosting Multi-modal Language Models
Computer Vision and Pattern Recognition (CVPR), 2023
Shitian Zhao
Zhuowan Li
Yadong Lu
Yaoyao Liu
Yan Wang
LRM
191
14
0
09 Dec 2023
InstructTA: Instruction-Tuned Targeted Attack for Large Vision-Language Models
Xunguang Wang
Zhenlan Ji
Pingchuan Ma
Zongjie Li
Shuai Wang
MLLM
317
19
0
04 Dec 2023
StoryGPT-V: Large Language Models as Consistent Story Visualizers
Computer Vision and Pattern Recognition (CVPR), 2023
Xiaoqian Shen
Mohamed Elhoseiny
VLM
446
20
0
04 Dec 2023
Bootstrapping Interactive Image-Text Alignment for Remote Sensing Image Captioning
IEEE Transactions on Geoscience and Remote Sensing (TGRS), 2023
Cong Yang
Zuchao Li
Lefei Zhang
163
61
0
02 Dec 2023
X-InstructBLIP: A Framework for aligning X-Modal instruction-aware representations to LLMs and Emergent Cross-modal Reasoning
Artemis Panagopoulou
Le Xue
Ning Yu
Junnan Li
Dongxu Li
Shafiq Joty
Ran Xu
Silvio Savarese
Caiming Xiong
Juan Carlos Niebles
VLM
MLLM
276
69
0
30 Nov 2023
Emergent Open-Vocabulary Semantic Segmentation from Off-the-shelf Vision-Language Models
Computer Vision and Pattern Recognition (CVPR), 2023
Jiayun Luo
Siddhesh Khandelwal
Leonid Sigal
Boyang Albert Li
MLLM
VLM
631
12
0
28 Nov 2023
Vamos: Versatile Action Models for Video Understanding
European Conference on Computer Vision (ECCV), 2023
Shijie Wang
Qi Zhao
Minh Quan Do
Nakul Agarwal
Kwonjoon Lee
Chen Sun
389
36
0
22 Nov 2023
Violet: A Vision-Language Model for Arabic Image Captioning with Gemini Decoder
Abdelrahman Mohamed
Fakhraddin Alwajih
El Moatez Billah Nagoudi
Alcides Alcoba Inciarte
Muhammad Abdul-Mageed
VLM
MLLM
168
13
0
15 Nov 2023
SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for Multi-modal Large Language Models
Ziyi Lin
Chris Liu
Renrui Zhang
Shiyang Feng
Longtian Qiu
...
Siyuan Huang
Yichi Zhang
Xuming He
Jiaming Song
Yu Qiao
MLLM
VLM
310
275
0
13 Nov 2023
InfMLLM: A Unified Framework for Visual-Language Tasks
Qiang-feng Zhou
Zhibin Wang
Wei Chu
Yinghui Xu
Hao Li
Yuan Qi
MLLM
144
12
0
12 Nov 2023
LRM: Large Reconstruction Model for Single Image to 3D
Yicong Hong
Kai Zhang
Jiuxiang Gu
Sai Bi
Yang Zhou
Difan Liu
Feng Liu
Kalyan Sunkavalli
Trung Bui
Hao Tan
3DV
3DH
517
679
0
08 Nov 2023
Emotional Theory of Mind: Bridging Fast Visual Processing with Slow Linguistic Reasoning
Affective Computing and Intelligent Interaction (ACII), 2023
Yasaman Etesam
Özge Nilay Yalçin
Chuxuan Zhang
Angelica Lim
296
5
0
30 Oct 2023
InViG: Benchmarking Interactive Visual Grounding with 500K Human-Robot Interactions
Hanbo Zhang
Jie Xu
Yuchen Mo
Tao Kong
192
1
0
18 Oct 2023
MusicAgent: An AI Agent for Music Understanding and Generation with Large Language Models
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023
Dingyao Yu
Kaitao Song
Peiling Lu
Tianyu He
Xu Tan
Wei Ye
Shikun Zhang
Jiang Bian
LLMAG
329
25
0
18 Oct 2023
MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning
Jun Chen
Deyao Zhu
Xiaoqian Shen
Xiang Li
Zechun Liu
Pengchuan Zhang
Raghuraman Krishnamoorthi
Vikas Chandra
Yunyang Xiong
Mohamed Elhoseiny
MLLM
1.4K
628
0
14 Oct 2023
Previous
1
2
3
4
Next