ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2102.10407
  4. Cited By
VisualGPT: Data-efficient Adaptation of Pretrained Language Models for
  Image Captioning
v1v2v3v4v5 (latest)

VisualGPT: Data-efficient Adaptation of Pretrained Language Models for Image Captioning

Computer Vision and Pattern Recognition (CVPR), 2021
20 February 2021
Jun Chen
Han Guo
Kai Yi
Boyang Albert Li
Mohamed Elhoseiny
    VLM
ArXiv (abs)PDFHTMLGithub (331★)

Papers citing "VisualGPT: Data-efficient Adaptation of Pretrained Language Models for Image Captioning"

50 / 165 papers shown
An Automated Survey of Generative Artificial Intelligence: Large Language Models, Architectures, Protocols, and Applications
An Automated Survey of Generative Artificial Intelligence: Large Language Models, Architectures, Protocols, and ApplicationsJournal of Computer Science (JCS), 2023
Roberto Gozalo-Brizuela
Eduardo C. Garrido-Merchán
SyDaLM&MAELM
440
142
0
10 Apr 2026
Leveraging Textual Compositional Reasoning for Robust Change Captioning
Leveraging Textual Compositional Reasoning for Robust Change Captioning
Kyu Ri Park
Jiyoung Park
Seong Tae Kim
Hong Joo Lee
Jung Uk Kim
CoGe
152
0
0
28 Nov 2025
Co-Training Vision Language Models for Remote Sensing Multi-task Learning
Co-Training Vision Language Models for Remote Sensing Multi-task Learning
Qingyun Li
Shuran Ma
Junwei Luo
Yi Yu
Yue Zhou
...
Xudong Lu
Xiaoxing Wang
Xin He
Yushi Chen
Xue Yang
274
3
0
26 Nov 2025
Multimodal Continual Instruction Tuning with Dynamic Gradient Guidance
Multimodal Continual Instruction Tuning with Dynamic Gradient Guidance
Songze Li
Mingyu Gao
Tonghua Su
Xu-Yao Zhang
Zhongjie Wang
CLL
402
2
0
19 Nov 2025
Towards Fast LLM Fine-tuning through Zeroth-Order Optimization with Projected Gradient-Aligned Perturbations
Towards Fast LLM Fine-tuning through Zeroth-Order Optimization with Projected Gradient-Aligned Perturbations
Zhendong Mi
Qitao Tan
Grace Li Zhang
Zhaozhuo Xu
Geng Yuan
Shaoyi Huang
167
1
0
21 Oct 2025
Graph4MM: Weaving Multimodal Learning with Structural Information
Graph4MM: Weaving Multimodal Learning with Structural Information
Xuying Ning
Dongqi Fu
Tianxin Wei
Wujiang Xu
Jingrui He
187
13
0
19 Oct 2025
QSVD: Efficient Low-rank Approximation for Unified Query-Key-Value Weight Compression in Low-Precision Vision-Language Models
QSVD: Efficient Low-rank Approximation for Unified Query-Key-Value Weight Compression in Low-Precision Vision-Language Models
Yutong Wang
Haiyu Wang
Sai Qian Zhang
133
3
0
18 Oct 2025
A Framework for Generating Artificial Datasets to Validate Absolute and Relative Position Concepts
A Framework for Generating Artificial Datasets to Validate Absolute and Relative Position Concepts
George Correa de Araujo
H. Maia
Hélio Pedrini
196
0
0
17 Sep 2025
Bridging Vision Language Models and Symbolic Grounding for Video Question Answering
Bridging Vision Language Models and Symbolic Grounding for Video Question Answering
Haodi Ma
Vyom Pathak
Daisy Zhe Wang
177
2
0
15 Sep 2025
Galaxea Open-World Dataset and G0 Dual-System VLA Model
Galaxea Open-World Dataset and G0 Dual-System VLA Model
Tao Jiang
Tianyuan Yuan
Yicheng Liu
Chenhao Lu
Jianning Cui
Xiao Liu
Shuiqi Cheng
Jiyang Gao
Huazhe Xu
Hang Zhao
LM&Ro
169
46
0
30 Aug 2025
VT-LVLM-AR: A Video-Temporal Large Vision-Language Model Adapter for Fine-Grained Action Recognition in Long-Term Videos
VT-LVLM-AR: A Video-Temporal Large Vision-Language Model Adapter for Fine-Grained Action Recognition in Long-Term Videos
Kaining Li
Shuwei He
Zihan Xu
VLM
131
1
0
21 Aug 2025
WeatherPrompt: Multi-modality Representation Learning for All-Weather Drone Visual Geo-Localization
WeatherPrompt: Multi-modality Representation Learning for All-Weather Drone Visual Geo-Localization
Jiahao Wen
Hang Yu
Zhedong Zheng
415
4
0
13 Aug 2025
PET2Rep: Towards Vision-Language Model-Drived Automated Radiology Report Generation for Positron Emission Tomography
PET2Rep: Towards Vision-Language Model-Drived Automated Radiology Report Generation for Positron Emission Tomography
Yichi Zhang
Wenbo Zhang
Zehui Ling
Gang Feng
Sisi Peng
...
Limei Han
Yuan Cheng
Zixin Hu
Yuan Qi
Le Xue
MedImLM&MA
229
4
0
06 Aug 2025
VLM4D: Towards Spatiotemporal Awareness in Vision Language Models
VLM4D: Towards Spatiotemporal Awareness in Vision Language Models
Shijie Zhou
Alexander Vilesov
Xuehai He
Ziyu Wan
Shuwang Zhang
Aditya Nagachandra
Di Chang
DongDong Chen
Xin Eric Wang
A. Kadambi
VLM
297
0
0
04 Aug 2025
ScreenCoder: Advancing Visual-to-Code Generation for Front-End Automation via Modular Multimodal Agents
ScreenCoder: Advancing Visual-to-Code Generation for Front-End Automation via Modular Multimodal Agents
Yilei Jiang
Y. Zheng
Yuxuan Wan
Jiaming Han
Qunzhong Wang
Michael R. Lyu
Xiangyu Yue
LLMAG
245
14
0
30 Jul 2025
Group Relative Augmentation for Data Efficient Action Detection
Group Relative Augmentation for Data Efficient Action Detection
Deep Patel
Iain Melvin
Zachary Izzo
Martin Renqiang Min
VLM
202
0
0
28 Jul 2025
ViGText: Deepfake Image Detection with Vision-Language Model Explanations and Graph Neural Networks
ViGText: Deepfake Image Detection with Vision-Language Model Explanations and Graph Neural Networks
Ahmad ALBarqawi
Mahmoud Nazzal
Issa M. Khalil
Abdallah Khreishah
Nhathai Phan
316
1
0
24 Jul 2025
Med-GRIM: Enhanced Zero-Shot Medical VQA using prompt-embedded Multimodal Graph RAG
Med-GRIM: Enhanced Zero-Shot Medical VQA using prompt-embedded Multimodal Graph RAG
Rakesh Raj Madavan
Akshat Kaimal
Hashim Faisal
Chandrakala Shanmuganathan
MedIm
201
1
0
20 Jul 2025
ElasticMM: Efficient Multimodal LLMs Serving with Elastic Multimodal Parallelism
ElasticMM: Efficient Multimodal LLMs Serving with Elastic Multimodal Parallelism
Zedong Liu
Shenggan Cheng
Guangming Tan
Yang You
Dingwen Tao
626
5
0
14 Jul 2025
Enabling Validation for Robust Few-Shot Recognition
Enabling Validation for Robust Few-Shot Recognition
Hanxin Wang
Tian Liu
Shu Kong
VLM
592
2
0
05 Jun 2025
Light as Deception: GPT-driven Natural Relighting Against Vision-Language Pre-training Models
Light as Deception: GPT-driven Natural Relighting Against Vision-Language Pre-training Models
Ying Yang
Jie Zhang
Xiao Lv
Di Lin
Tao Xiang
Qing Guo
AAMLVLM
215
1
0
30 May 2025
Knowledge Insulating Vision-Language-Action Models: Train Fast, Run Fast, Generalize Better
Knowledge Insulating Vision-Language-Action Models: Train Fast, Run Fast, Generalize Better
Danny Driess
Jost Tobias Springenberg
Brian Ichter
Lili Yu
Adrian Li-Bell
...
Allen Z. Ren
Homer Walke
Quan Vuong
Lucy Xiaoyang Shi
Sergey Levine
359
73
0
29 May 2025
KerZOO: Kernel Function Informed Zeroth-Order Optimization for Accurate and Accelerated LLM Fine-Tuning
KerZOO: Kernel Function Informed Zeroth-Order Optimization for Accurate and Accelerated LLM Fine-Tuning
Zhendong Mi
Qitao Tan
Xiaodong Yu
Zining Zhu
Geng Yuan
Shaoyi Huang
388
4
0
24 May 2025
Analysing the Robustness of Vision-Language-Models to Common Corruptions
Analysing the Robustness of Vision-Language-Models to Common Corruptions
Muhammad Usama
Syeda Aishah Asim
Syed Bilal Ali
Syed Talal Wasim
Umair Bin Mansoor
VLM
431
11
0
18 Apr 2025
EarthGPT-X: A Spatial MLLM for Multi-level Multi-Source Remote Sensing Imagery Understanding with Visual Prompting
EarthGPT-X: A Spatial MLLM for Multi-level Multi-Source Remote Sensing Imagery Understanding with Visual PromptingIEEE Transactions on Geoscience and Remote Sensing (IEEE TGRS), 2025
Wei Zhang
Miaoxin Cai
Yaqian Ning
Tianze Zhang
Yin Zhuang
He Chen
He Chen
Jun Li
Xuerui Mao
495
0
0
17 Apr 2025
Video Summarization with Large Language Models
Video Summarization with Large Language ModelsComputer Vision and Pattern Recognition (CVPR), 2025
Min Jung Lee
Dayoung Gong
Minsu Cho
342
16
0
15 Apr 2025
Group-based Distinctive Image Captioning with Memory Difference Encoding and Attention
Group-based Distinctive Image Captioning with Memory Difference Encoding and AttentionInternational Journal of Computer Vision (IJCV), 2024
Jiuniu Wang
Wenjia Xu
Qingzhong Wang
Antoni B. Chan
502
3
0
03 Apr 2025
Semantic-Spatial Feature Fusion with Dynamic Graph Refinement for Remote Sensing Image Captioning
Semantic-Spatial Feature Fusion with Dynamic Graph Refinement for Remote Sensing Image CaptioningIEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing (IEEE J-STARS), 2025
Maofu Liu
Jiahui Liu
Xiaokang Zhang
354
5
0
30 Mar 2025
CubeRobot: Grounding Language in Rubik's Cube Manipulation via Vision-Language Model
CubeRobot: Grounding Language in Rubik's Cube Manipulation via Vision-Language ModelThe Web Conference (WWW), 2025
Feiyang Wang
Xiaomin Yu
Wangyu Wu
LM&Ro
307
6
0
25 Mar 2025
Mind with Eyes: from Language Reasoning to Multimodal Reasoning
Mind with Eyes: from Language Reasoning to Multimodal Reasoning
Zhiyu Lin
Yifei Gao
Xian Zhao
Yunfan Yang
Jitao Sang
LRM
377
21
0
23 Mar 2025
A Framework for a Capability-driven Evaluation of Scenario Understanding for Multimodal Large Language Models in Autonomous Driving
A Framework for a Capability-driven Evaluation of Scenario Understanding for Multimodal Large Language Models in Autonomous Driving
Tin Stribor Sohn
Philipp Reis
Maximilian Dillitzer
Johannes Bach
Jason J. Corso
Eric Sax
ELMLRM
362
6
0
14 Mar 2025
A Frustratingly Simple Yet Highly Effective Attack Baseline: Over 90% Success Rate Against the Strong Black-box Models of GPT-4.5/4o/o1
A Frustratingly Simple Yet Highly Effective Attack Baseline: Over 90% Success Rate Against the Strong Black-box Models of GPT-4.5/4o/o1
Zhaoyi Li
Xiaohan Zhao
Dong-Dong Wu
Jiacheng Cui
Zhiqiang Shen
AAMLVLM
655
24
0
13 Mar 2025
CL-MoE: Enhancing Multimodal Large Language Model with Dual Momentum Mixture-of-Experts for Continual Visual Question Answering
CL-MoE: Enhancing Multimodal Large Language Model with Dual Momentum Mixture-of-Experts for Continual Visual Question AnsweringComputer Vision and Pattern Recognition (CVPR), 2025
Tianyu Huai
Jie Zhou
Xingjiao Wu
Qin Chen
Qingchun Bai
Ze Zhou
Liang He
MoE
364
16
0
01 Mar 2025
A Survey of Large Language Models for Healthcare: from Data, Technology, and Applications to Accountability and Ethics
A Survey of Large Language Models for Healthcare: from Data, Technology, and Applications to Accountability and EthicsInformation Fusion (Inf. Fusion), 2023
Kai He
Rui Mao
Qika Lin
Yucheng Ruan
Xiang Lan
Mengling Feng
Xiaoshi Zhong
LM&MAAILaw
906
302
0
28 Jan 2025
Patent Figure Classification using Large Vision-language Models
Patent Figure Classification using Large Vision-language ModelsEuropean Conference on Information Retrieval (ECIR), 2025
Sushil Awale
Eric Müller-Budack
Ralph Ewerth
241
1
0
22 Jan 2025
MuMu-LLaMA: Multi-modal Music Understanding and Generation via Large
  Language Models
MuMu-LLaMA: Multi-modal Music Understanding and Generation via Large Language Models
Shansong Liu
Atin Sakkeer Hussain
Qilong Wu
Chenshuo Sun
Ying Shan
AuLLM
339
0
0
09 Dec 2024
Multimodal Fact-Checking with Vision Language Models: A Probing
  Classifier based Solution with Embedding Strategies
Multimodal Fact-Checking with Vision Language Models: A Probing Classifier based Solution with Embedding StrategiesInternational Conference on Computational Linguistics (COLING), 2024
R. Çekinel
Pinar Karagoz
Cagri Coltekin
285
10
0
06 Dec 2024
HEIE: MLLM-Based Hierarchical Explainable AIGC Image Implausibility Evaluator
HEIE: MLLM-Based Hierarchical Explainable AIGC Image Implausibility EvaluatorComputer Vision and Pattern Recognition (CVPR), 2024
Fan Yang
Ru Zhen
Jinqiao Wang
Yanhao Zhang
Haoxiang Chen
Haonan Lu
Sicheng Zhao
Guiguang Ding
603
11
0
26 Nov 2024
Chain of Attack: On the Robustness of Vision-Language Models Against
  Transfer-Based Adversarial Attacks
Chain of Attack: On the Robustness of Vision-Language Models Against Transfer-Based Adversarial AttacksComputer Vision and Pattern Recognition (CVPR), 2024
Peng Xie
Yequan Bie
Jianda Mao
Yangqiu Song
Yang Wang
Hao Chen
Kani Chen
AAML
403
22
0
24 Nov 2024
No Culture Left Behind: ArtELingo-28, a Benchmark of WikiArt with
  Captions in 28 Languages
No Culture Left Behind: ArtELingo-28, a Benchmark of WikiArt with Captions in 28 LanguagesConference on Empirical Methods in Natural Language Processing (EMNLP), 2024
Youssef Mohamed
Runjia Li
Ibrahim Said Ahmad
Kilichbek Haydarov
Juil Sock
Kenneth Church
Mohamed Elhoseiny
VLM
257
19
0
06 Nov 2024
Interaction2Code: Benchmarking MLLM-based Interactive Webpage Code Generation from Interactive Prototyping
Interaction2Code: Benchmarking MLLM-based Interactive Webpage Code Generation from Interactive Prototyping
Jingyu Xiao
Yuxuan Wan
Yintong Huo
Zihan Wang
Xinyi Xu
Wenxuan Wang
Zhiyao Xu
Longji Xu
Michael R. Lyu
435
1
0
05 Nov 2024
SoK: Prompt Hacking of Large Language Models
SoK: Prompt Hacking of Large Language ModelsBigData Congress [Services Society] (BSS), 2024
Baha Rababah
Shang
Wu
Matthew Kwiatkowski
Carson Leung
Cuneyt Gurcan Akcora
AAML
310
13
0
16 Oct 2024
Removing Distributional Discrepancies in Captions Improves Image-Text
  Alignment
Removing Distributional Discrepancies in Captions Improves Image-Text AlignmentEuropean Conference on Computer Vision (ECCV), 2024
Yuheng Li
Haotian Liu
Mu Cai
Yijun Li
Eli Shechtman
Zhe Lin
Yong Jae Lee
Krishna Kumar Singh
VLM
957
8
0
01 Oct 2024
HPT++: Hierarchically Prompting Vision-Language Models with
  Multi-Granularity Knowledge Generation and Improved Structure Modeling
HPT++: Hierarchically Prompting Vision-Language Models with Multi-Granularity Knowledge Generation and Improved Structure Modeling
Yubin Wang
Xinyang Jiang
De Cheng
Wenli Sun
Dongsheng Li
Cairong Zhao
VLM
252
1
0
27 Aug 2024
Surveying the Landscape of Image Captioning Evaluation: A Comprehensive Taxonomy, Trends and Metrics Analysis
Surveying the Landscape of Image Captioning Evaluation: A Comprehensive Taxonomy, Trends and Metrics Analysis
Uri Berger
Gabriel Stanovsky
Omri Abend
Lea Frermann
541
0
0
09 Aug 2024
MMInstruct: A High-Quality Multi-Modal Instruction Tuning Dataset with
  Extensive Diversity
MMInstruct: A High-Quality Multi-Modal Instruction Tuning Dataset with Extensive Diversity
Yangzhou Liu
Yue Cao
Zhangwei Gao
Weiyun Wang
Zhe Chen
...
Lewei Lu
Xizhou Zhu
Tong Lu
Yu Qiao
Jifeng Dai
VLMMLLM
368
45
0
22 Jul 2024
Continual Panoptic Perception: Towards Multi-modal Incremental
  Interpretation of Remote Sensing Images
Continual Panoptic Perception: Towards Multi-modal Incremental Interpretation of Remote Sensing Images
Bo Yuan
Danpei Zhao
Zhuoran Liu
Wentao Li
Tian Li
CLLVLM
450
5
0
19 Jul 2024
EarthMarker: Visual Prompt Learning for Region-level and Point-level
  Remote Sensing Imagery Comprehension
EarthMarker: Visual Prompt Learning for Region-level and Point-level Remote Sensing Imagery Comprehension
Wei Zhang
Miaoxin Cai
Tong Zhang
Jun Li
Zhuang Yin
Xuerui Mao
461
3
0
18 Jul 2024
Constructing Concept-based Models to Mitigate Spurious Correlations with
  Minimal Human Effort
Constructing Concept-based Models to Mitigate Spurious Correlations with Minimal Human Effort
Jeeyung Kim
Ze Wang
Qiang Qiu
322
6
0
12 Jul 2024
MUMU: Bootstrapping Multimodal Image Generation from Text-to-Image Data
MUMU: Bootstrapping Multimodal Image Generation from Text-to-Image Data
William Berman
A. Peysakhovich
377
5
0
26 Jun 2024
1234
Next
Page 1 of 4