ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2102.10407
  4. Cited By
VisualGPT: Data-efficient Adaptation of Pretrained Language Models for
  Image Captioning
v1v2v3v4v5 (latest)

VisualGPT: Data-efficient Adaptation of Pretrained Language Models for Image Captioning

Computer Vision and Pattern Recognition (CVPR), 2021
20 February 2021
Jun Chen
Han Guo
Kai Yi
Boyang Albert Li
Mohamed Elhoseiny
    VLM
ArXiv (abs)PDFHTMLGithub (331★)

Papers citing "VisualGPT: Data-efficient Adaptation of Pretrained Language Models for Image Captioning"

50 / 165 papers shown
Leveraging Textual Compositional Reasoning for Robust Change Captioning
Leveraging Textual Compositional Reasoning for Robust Change Captioning
Kyu Ri Park
Jiyoung Park
Seong Tae Kim
Hong Joo Lee
Jung Uk Kim
CoGe
113
0
0
28 Nov 2025
Co-Training Vision Language Models for Remote Sensing Multi-task Learning
Co-Training Vision Language Models for Remote Sensing Multi-task Learning
Qingyun Li
Shuran Ma
Junwei Luo
Yi Yu
Yue Zhou
...
Xudong Lu
Xiaoxing Wang
Xin He
Yushi Chen
Xue Yang
179
1
0
26 Nov 2025
Multimodal Continual Instruction Tuning with Dynamic Gradient Guidance
Multimodal Continual Instruction Tuning with Dynamic Gradient Guidance
Songze Li
Mingyu Gao
Tonghua Su
Xu-Yao Zhang
Zhongjie Wang
CLL
328
0
0
19 Nov 2025
Towards Fast LLM Fine-tuning through Zeroth-Order Optimization with Projected Gradient-Aligned Perturbations
Towards Fast LLM Fine-tuning through Zeroth-Order Optimization with Projected Gradient-Aligned Perturbations
Zhendong Mi
Qitao Tan
Grace Li Zhang
Zhaozhuo Xu
Geng Yuan
Shaoyi Huang
145
0
0
21 Oct 2025
Graph4MM: Weaving Multimodal Learning with Structural Information
Graph4MM: Weaving Multimodal Learning with Structural Information
Xuying Ning
Dongqi Fu
Tianxin Wei
Wujiang Xu
Jingrui He
118
4
0
19 Oct 2025
QSVD: Efficient Low-rank Approximation for Unified Query-Key-Value Weight Compression in Low-Precision Vision-Language Models
QSVD: Efficient Low-rank Approximation for Unified Query-Key-Value Weight Compression in Low-Precision Vision-Language Models
Yutong Wang
Haiyu Wang
Sai Qian Zhang
89
1
0
18 Oct 2025
A Framework for Generating Artificial Datasets to Validate Absolute and Relative Position Concepts
A Framework for Generating Artificial Datasets to Validate Absolute and Relative Position Concepts
George Correa de Araujo
H. Maia
Hélio Pedrini
144
0
0
17 Sep 2025
Bridging Vision Language Models and Symbolic Grounding for Video Question Answering
Bridging Vision Language Models and Symbolic Grounding for Video Question Answering
Haodi Ma
Vyom Pathak
Daisy Zhe Wang
114
1
0
15 Sep 2025
Galaxea Open-World Dataset and G0 Dual-System VLA Model
Galaxea Open-World Dataset and G0 Dual-System VLA Model
Tao Jiang
Tianyuan Yuan
Yicheng Liu
Chenhao Lu
Jianning Cui
Xiao Liu
Shuiqi Cheng
Jiyang Gao
Huazhe Xu
Hang Zhao
LM&Ro
121
18
0
30 Aug 2025
VT-LVLM-AR: A Video-Temporal Large Vision-Language Model Adapter for Fine-Grained Action Recognition in Long-Term Videos
VT-LVLM-AR: A Video-Temporal Large Vision-Language Model Adapter for Fine-Grained Action Recognition in Long-Term Videos
Kaining Li
Shuwei He
Zihan Xu
VLM
95
0
0
21 Aug 2025
WeatherPrompt: Multi-modality Representation Learning for All-Weather Drone Visual Geo-Localization
WeatherPrompt: Multi-modality Representation Learning for All-Weather Drone Visual Geo-Localization
Jiahao Wen
Hang Yu
Zhedong Zheng
251
2
0
13 Aug 2025
PET2Rep: Towards Vision-Language Model-Drived Automated Radiology Report Generation for Positron Emission Tomography
PET2Rep: Towards Vision-Language Model-Drived Automated Radiology Report Generation for Positron Emission Tomography
Yichi Zhang
Wenbo Zhang
Zehui Ling
Gang Feng
Sisi Peng
...
Limei Han
Yuan Cheng
Zixin Hu
Yuan Qi
Le Xue
MedImLM&MA
146
2
0
06 Aug 2025
VLM4D: Towards Spatiotemporal Awareness in Vision Language Models
VLM4D: Towards Spatiotemporal Awareness in Vision Language Models
Shijie Zhou
Alexander Vilesov
Xuehai He
Ziyu Wan
Shuwang Zhang
Aditya Nagachandra
Di Chang
DongDong Chen
Xin Eric Wang
A. Kadambi
VLM
185
0
0
04 Aug 2025
ScreenCoder: Advancing Visual-to-Code Generation for Front-End Automation via Modular Multimodal Agents
ScreenCoder: Advancing Visual-to-Code Generation for Front-End Automation via Modular Multimodal Agents
Yilei Jiang
Y. Zheng
Yuxuan Wan
Jiaming Han
Qunzhong Wang
Michael R. Lyu
Xiangyu Yue
LLMAG
199
8
0
30 Jul 2025
Group Relative Augmentation for Data Efficient Action Detection
Group Relative Augmentation for Data Efficient Action Detection
Deep Patel
Iain Melvin
Zachary Izzo
Martin Renqiang Min
VLM
163
0
0
28 Jul 2025
ViGText: Deepfake Image Detection with Vision-Language Model Explanations and Graph Neural Networks
ViGText: Deepfake Image Detection with Vision-Language Model Explanations and Graph Neural Networks
Ahmad ALBarqawi
Mahmoud Nazzal
Issa M. Khalil
Abdallah Khreishah
Nhathai Phan
242
0
0
24 Jul 2025
Med-GRIM: Enhanced Zero-Shot Medical VQA using prompt-embedded Multimodal Graph RAG
Med-GRIM: Enhanced Zero-Shot Medical VQA using prompt-embedded Multimodal Graph RAG
Rakesh Raj Madavan
Akshat Kaimal
Hashim Faisal
Chandrakala Shanmuganathan
MedIm
122
1
0
20 Jul 2025
ElasticMM: Efficient Multimodal LLMs Serving with Elastic Multimodal Parallelism
ElasticMM: Efficient Multimodal LLMs Serving with Elastic Multimodal Parallelism
Zedong Liu
Shenggan Cheng
Guangming Tan
Yang You
Dingwen Tao
544
3
0
14 Jul 2025
Enabling Validation for Robust Few-Shot Recognition
Enabling Validation for Robust Few-Shot Recognition
Hanxin Wang
Tian Liu
Shu Kong
VLM
449
1
0
05 Jun 2025
Light as Deception: GPT-driven Natural Relighting Against Vision-Language Pre-training Models
Light as Deception: GPT-driven Natural Relighting Against Vision-Language Pre-training Models
Ying Yang
Jie Zhang
Xiao Lv
Di Lin
Tao Xiang
Qing Guo
AAMLVLM
162
0
0
30 May 2025
Knowledge Insulating Vision-Language-Action Models: Train Fast, Run Fast, Generalize Better
Knowledge Insulating Vision-Language-Action Models: Train Fast, Run Fast, Generalize Better
Danny Driess
Jost Tobias Springenberg
Brian Ichter
Lili Yu
Adrian Li-Bell
...
Allen Z. Ren
Homer Walke
Quan Vuong
Lucy Xiaoyang Shi
Sergey Levine
294
46
0
29 May 2025
KerZOO: Kernel Function Informed Zeroth-Order Optimization for Accurate and Accelerated LLM Fine-Tuning
KerZOO: Kernel Function Informed Zeroth-Order Optimization for Accurate and Accelerated LLM Fine-Tuning
Zhendong Mi
Qitao Tan
Xiaodong Yu
Zining Zhu
Geng Yuan
Shaoyi Huang
356
4
0
24 May 2025
Analysing the Robustness of Vision-Language-Models to Common Corruptions
Analysing the Robustness of Vision-Language-Models to Common Corruptions
Muhammad Usama
Syeda Aishah Asim
Syed Bilal Ali
Syed Talal Wasim
Umair Bin Mansoor
VLM
342
3
0
18 Apr 2025
EarthGPT-X: A Spatial MLLM for Multi-level Multi-Source Remote Sensing Imagery Understanding with Visual Prompting
EarthGPT-X: A Spatial MLLM for Multi-level Multi-Source Remote Sensing Imagery Understanding with Visual PromptingIEEE Transactions on Geoscience and Remote Sensing (IEEE TGRS), 2025
Wei Zhang
Miaoxin Cai
Yaqian Ning
Tianze Zhang
Yin Zhuang
He Chen
He Chen
Jun Li
Xuerui Mao
402
0
0
17 Apr 2025
Video Summarization with Large Language Models
Video Summarization with Large Language ModelsComputer Vision and Pattern Recognition (CVPR), 2025
Min Jung Lee
Dayoung Gong
Minsu Cho
265
7
0
15 Apr 2025
Group-based Distinctive Image Captioning with Memory Difference Encoding and Attention
Group-based Distinctive Image Captioning with Memory Difference Encoding and AttentionInternational Journal of Computer Vision (IJCV), 2024
Jiuniu Wang
Wenjia Xu
Qingzhong Wang
Antoni B. Chan
370
2
0
03 Apr 2025
Semantic-Spatial Feature Fusion with Dynamic Graph Refinement for Remote Sensing Image Captioning
Semantic-Spatial Feature Fusion with Dynamic Graph Refinement for Remote Sensing Image CaptioningIEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing (IEEE J-STARS), 2025
Maofu Liu
Jiahui Liu
Xiaokang Zhang
287
5
0
30 Mar 2025
CubeRobot: Grounding Language in Rubik's Cube Manipulation via Vision-Language Model
CubeRobot: Grounding Language in Rubik's Cube Manipulation via Vision-Language ModelThe Web Conference (WWW), 2025
Feiyang Wang
Xiaomin Yu
Wangyu Wu
LM&Ro
245
4
0
25 Mar 2025
Mind with Eyes: from Language Reasoning to Multimodal Reasoning
Mind with Eyes: from Language Reasoning to Multimodal Reasoning
Zhiyu Lin
Yifei Gao
Xian Zhao
Yunfan Yang
Jitao Sang
LRM
320
16
0
23 Mar 2025
A Framework for a Capability-driven Evaluation of Scenario Understanding for Multimodal Large Language Models in Autonomous Driving
Tin Stribor Sohn
Philipp Reis
Maximilian Dillitzer
Johannes Bach
Jason J. Corso
Eric Sax
ELMLRM
278
4
0
14 Mar 2025
A Frustratingly Simple Yet Highly Effective Attack Baseline: Over 90% Success Rate Against the Strong Black-box Models of GPT-4.5/4o/o1
A Frustratingly Simple Yet Highly Effective Attack Baseline: Over 90% Success Rate Against the Strong Black-box Models of GPT-4.5/4o/o1
Zhaoyi Li
Xiaohan Zhao
Dong-Dong Wu
Jiacheng Cui
Zhiqiang Shen
AAMLVLM
519
9
0
13 Mar 2025
CL-MoE: Enhancing Multimodal Large Language Model with Dual Momentum Mixture-of-Experts for Continual Visual Question AnsweringComputer Vision and Pattern Recognition (CVPR), 2025
Tianyu Huai
Jie Zhou
Xingjiao Wu
Qin Chen
Qingchun Bai
Ze Zhou
Liang He
MoE
332
10
0
01 Mar 2025
Interaction2Code: Benchmarking MLLM-based Interactive Webpage Code Generation from Interactive Prototyping
Interaction2Code: Benchmarking MLLM-based Interactive Webpage Code Generation from Interactive Prototyping
Jingyu Xiao
Yuxuan Wan
Yintong Huo
Zihan Wang
Xinyi Xu
Wenxuan Wang
Zhiyao Xu
Longji Xu
Michael R. Lyu
348
12
0
21 Feb 2025
A Survey of Large Language Models for Healthcare: from Data, Technology, and Applications to Accountability and Ethics
A Survey of Large Language Models for Healthcare: from Data, Technology, and Applications to Accountability and EthicsInformation Fusion (Inf. Fusion), 2023
Kai He
Rui Mao
Qika Lin
Yucheng Ruan
Xiang Lan
Mengling Feng
Xiaoshi Zhong
LM&MAAILaw
726
269
0
28 Jan 2025
Patent Figure Classification using Large Vision-language Models
Patent Figure Classification using Large Vision-language ModelsEuropean Conference on Information Retrieval (ECIR), 2025
Sushil Awale
Eric Müller-Budack
Ralph Ewerth
210
1
0
22 Jan 2025
MuMu-LLaMA: Multi-modal Music Understanding and Generation via Large
  Language Models
MuMu-LLaMA: Multi-modal Music Understanding and Generation via Large Language Models
Shansong Liu
Atin Sakkeer Hussain
Qilong Wu
Chenshuo Sun
Ying Shan
AuLLM
270
0
0
09 Dec 2024
Multimodal Fact-Checking with Vision Language Models: A Probing
  Classifier based Solution with Embedding Strategies
Multimodal Fact-Checking with Vision Language Models: A Probing Classifier based Solution with Embedding StrategiesInternational Conference on Computational Linguistics (COLING), 2024
R. Çekinel
Pinar Karagoz
Cagri Coltekin
238
7
0
06 Dec 2024
HEIE: MLLM-Based Hierarchical Explainable AIGC Image Implausibility Evaluator
HEIE: MLLM-Based Hierarchical Explainable AIGC Image Implausibility EvaluatorComputer Vision and Pattern Recognition (CVPR), 2024
Fan Yang
Ru Zhen
Jinqiao Wang
Yanhao Zhang
Haoxiang Chen
Haonan Lu
Sicheng Zhao
Guiguang Ding
453
10
0
26 Nov 2024
Chain of Attack: On the Robustness of Vision-Language Models Against
  Transfer-Based Adversarial Attacks
Chain of Attack: On the Robustness of Vision-Language Models Against Transfer-Based Adversarial AttacksComputer Vision and Pattern Recognition (CVPR), 2024
Peng Xie
Yequan Bie
Jianda Mao
Yangqiu Song
Yang Wang
Hao Chen
Kani Chen
AAML
349
8
0
24 Nov 2024
No Culture Left Behind: ArtELingo-28, a Benchmark of WikiArt with
  Captions in 28 Languages
No Culture Left Behind: ArtELingo-28, a Benchmark of WikiArt with Captions in 28 LanguagesConference on Empirical Methods in Natural Language Processing (EMNLP), 2024
Youssef Mohamed
Runjia Li
Ibrahim Said Ahmad
Kilichbek Haydarov
Juil Sock
Kenneth Church
Mohamed Elhoseiny
VLM
193
15
0
06 Nov 2024
SoK: Prompt Hacking of Large Language Models
SoK: Prompt Hacking of Large Language ModelsBigData Congress [Services Society] (BSS), 2024
Baha Rababah
Shang
Wu
Matthew Kwiatkowski
Carson Leung
Cuneyt Gurcan Akcora
AAML
170
6
0
16 Oct 2024
Removing Distributional Discrepancies in Captions Improves Image-Text
  Alignment
Removing Distributional Discrepancies in Captions Improves Image-Text AlignmentEuropean Conference on Computer Vision (ECCV), 2024
Yuheng Li
Haotian Liu
Mu Cai
Yijun Li
Eli Shechtman
Zhe Lin
Yong Jae Lee
Krishna Kumar Singh
VLM
904
7
0
01 Oct 2024
HPT++: Hierarchically Prompting Vision-Language Models with
  Multi-Granularity Knowledge Generation and Improved Structure Modeling
HPT++: Hierarchically Prompting Vision-Language Models with Multi-Granularity Knowledge Generation and Improved Structure Modeling
Yubin Wang
Xinyang Jiang
De Cheng
Wenli Sun
Dongsheng Li
Cairong Zhao
VLM
228
1
0
27 Aug 2024
Surveying the Landscape of Image Captioning Evaluation: A Comprehensive Taxonomy, Trends and Metrics Analysis
Surveying the Landscape of Image Captioning Evaluation: A Comprehensive Taxonomy, Trends and Metrics Analysis
Uri Berger
Gabriel Stanovsky
Omri Abend
Lea Frermann
438
0
0
09 Aug 2024
MMInstruct: A High-Quality Multi-Modal Instruction Tuning Dataset with
  Extensive Diversity
MMInstruct: A High-Quality Multi-Modal Instruction Tuning Dataset with Extensive Diversity
Yangzhou Liu
Yue Cao
Zhangwei Gao
Weiyun Wang
Zhe Chen
...
Lewei Lu
Xizhou Zhu
Tong Lu
Yu Qiao
Jifeng Dai
VLMMLLM
311
41
0
22 Jul 2024
Continual Panoptic Perception: Towards Multi-modal Incremental
  Interpretation of Remote Sensing Images
Continual Panoptic Perception: Towards Multi-modal Incremental Interpretation of Remote Sensing Images
Bo Yuan
Danpei Zhao
Zhuoran Liu
Wentao Li
Tian Li
CLLVLM
391
4
0
19 Jul 2024
EarthMarker: Visual Prompt Learning for Region-level and Point-level
  Remote Sensing Imagery Comprehension
EarthMarker: Visual Prompt Learning for Region-level and Point-level Remote Sensing Imagery Comprehension
Wei Zhang
Miaoxin Cai
Tong Zhang
Jun Li
Zhuang Yin
Xuerui Mao
389
3
0
18 Jul 2024
Constructing Concept-based Models to Mitigate Spurious Correlations with
  Minimal Human Effort
Constructing Concept-based Models to Mitigate Spurious Correlations with Minimal Human Effort
Jeeyung Kim
Ze Wang
Qiang Qiu
236
6
0
12 Jul 2024
MUMU: Bootstrapping Multimodal Image Generation from Text-to-Image Data
MUMU: Bootstrapping Multimodal Image Generation from Text-to-Image Data
William Berman
A. Peysakhovich
280
5
0
26 Jun 2024
Automatically Generating UI Code from Screenshot: A Divide-and-Conquer-Based Approach
Automatically Generating UI Code from Screenshot: A Divide-and-Conquer-Based Approach
Yuxuan Wan
Chaozheng Wang
Yi Dong
Wenxuan Wang
Shuqing Li
Yintong Huo
Michael R. Lyu
3DV
641
29
0
24 Jun 2024
1234
Next