ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2211.09699
  4. Cited By
PromptCap: Prompt-Guided Task-Aware Image Captioning

PromptCap: Prompt-Guided Task-Aware Image Captioning

15 November 2022
Yushi Hu
Hang Hua
Zhengyuan Yang
Weijia Shi
Noah A. Smith
Jiebo Luo
ArXivPDFHTML

Papers citing "PromptCap: Prompt-Guided Task-Aware Image Captioning"

50 / 88 papers shown
Title
Socratic Chart: Cooperating Multiple Agents for Robust SVG Chart Understanding
Socratic Chart: Cooperating Multiple Agents for Robust SVG Chart Understanding
Yuyang Ji
Haohan Wang
LRM
26
0
0
14 Apr 2025
QIRL: Boosting Visual Question Answering via Optimized Question-Image Relation Learning
QIRL: Boosting Visual Question Answering via Optimized Question-Image Relation Learning
Quanxing Xu
Ling Zhou
X. Zhong
Feifei Zhang
Rubing Huang
Chia-Wen Lin
26
0
0
04 Apr 2025
Towards Top-Down Reasoning: An Explainable Multi-Agent Approach for Visual Question Answering
Towards Top-Down Reasoning: An Explainable Multi-Agent Approach for Visual Question Answering
Zeqing Wang
Wentao Wan
Qiqing Lao
Runmeng Chen
Minjie Lang
Keze Wang
Liang Lin
Liang Lin
LRM
92
3
0
17 Feb 2025
Prompt-Driven Continual Graph Learning
Prompt-Driven Continual Graph Learning
Qi Wang
Tianfei Zhou
Ye Yuan
Rui Mao
CLL
35
0
0
10 Feb 2025
DriveMLLM: A Benchmark for Spatial Understanding with Multimodal Large
  Language Models in Autonomous Driving
DriveMLLM: A Benchmark for Spatial Understanding with Multimodal Large Language Models in Autonomous Driving
Xianda Guo
Ruijun Zhang
Yiqun Duan
Yuhang He
Chenming Zhang
Shuai Liu
Long Chen
LRM
61
11
0
20 Nov 2024
Difficult Task Yes but Simple Task No: Unveiling the Laziness in
  Multimodal LLMs
Difficult Task Yes but Simple Task No: Unveiling the Laziness in Multimodal LLMs
Sihang Zhao
Youliang Yuan
Xiaoying Tang
Pinjia He
14
2
0
15 Oct 2024
EAGLE: Egocentric AGgregated Language-video Engine
EAGLE: Egocentric AGgregated Language-video Engine
Jing Bi
Yunlong Tang
Luchuan Song
A. Vosoughi
Nguyen Nguyen
Chenliang Xu
25
8
0
26 Sep 2024
From Linguistic Giants to Sensory Maestros: A Survey on Cross-Modal
  Reasoning with Large Language Models
From Linguistic Giants to Sensory Maestros: A Survey on Cross-Modal Reasoning with Large Language Models
Shengsheng Qian
Zuyi Zhou
Dizhan Xue
Bing Wang
Changsheng Xu
LRM
29
1
0
19 Sep 2024
Knowledge Acquisition Disentanglement for Knowledge-based Visual
  Question Answering with Large Language Models
Knowledge Acquisition Disentanglement for Knowledge-based Visual Question Answering with Large Language Models
Wenbin An
Feng Tian
Jiahao Nie
Wenkai Shi
Haonan Lin
Yan Chen
Qianying Wang
Y. Wu
Guang Dai
Ping Chen
VLM
32
4
0
22 Jul 2024
EchoSight: Advancing Visual-Language Models with Wiki Knowledge
EchoSight: Advancing Visual-Language Models with Wiki Knowledge
Yibin Yan
Weidi Xie
RALM
14
8
0
17 Jul 2024
Visual Sketchpad: Sketching as a Visual Chain of Thought for Multimodal
  Language Models
Visual Sketchpad: Sketching as a Visual Chain of Thought for Multimodal Language Models
Yushi Hu
Weijia Shi
Xingyu Fu
Dan Roth
Mari Ostendorf
Luke Zettlemoyer
Noah A. Smith
Ranjay Krishna
LRM
32
34
0
13 Jun 2024
Mixture of Rationale: Multi-Modal Reasoning Mixture for Visual Question
  Answering
Mixture of Rationale: Multi-Modal Reasoning Mixture for Visual Question Answering
Tao Li
Linjun Shou
Xuejun Liu
22
0
0
03 Jun 2024
Retrieval Meets Reasoning: Even High-school Textbook Knowledge Benefits
  Multimodal Reasoning
Retrieval Meets Reasoning: Even High-school Textbook Knowledge Benefits Multimodal Reasoning
Cheng Tan
Jingxuan Wei
Linzhuang Sun
Zhangyang Gao
Siyuan Li
Bihui Yu
Ruifeng Guo
Stan Z. Li
ReLM
LRM
3DV
52
6
0
31 May 2024
Similarity is Not All You Need: Endowing Retrieval Augmented Generation
  with Multi Layered Thoughts
Similarity is Not All You Need: Endowing Retrieval Augmented Generation with Multi Layered Thoughts
Chunjing Gan
Dan Yang
Binbin Hu
Hanxiao Zhang
Siyuan Li
...
Lin Ju
Zhiqiang Zhang
Jinjie Gu
Lei Liang
Jun Zhou
38
9
0
30 May 2024
PromptFix: You Prompt and We Fix the Photo
PromptFix: You Prompt and We Fix the Photo
Yongsheng Yu
Ziyun Zeng
Hang Hua
Jianlong Fu
Jiebo Luo
MLLM
DiffM
VLM
33
3
0
27 May 2024
Detecting Multimodal Situations with Insufficient Context and Abstaining from Baseless Predictions
Detecting Multimodal Situations with Insufficient Context and Abstaining from Baseless Predictions
Junzhang Liu
Zhecan Wang
Hammad A. Ayyubi
Haoxuan You
Chris Thomas
Rui Sun
Shih-Fu Chang
Kai-Wei Chang
16
0
0
18 May 2024
Visual Fact Checker: Enabling High-Fidelity Detailed Caption Generation
Visual Fact Checker: Enabling High-Fidelity Detailed Caption Generation
Yunhao Ge
Xiaohui Zeng
Jacob Samuel Huffman
Tsung-Yi Lin
Ming-Yu Liu
Yin Cui
CoGe
DiffM
22
14
0
30 Apr 2024
Simplifying Multimodality: Unimodal Approach to Multimodal Challenges in
  Radiology with General-Domain Large Language Model
Simplifying Multimodality: Unimodal Approach to Multimodal Challenges in Radiology with General-Domain Large Language Model
Seonhee Cho
Choonghan Kim
Jiho Lee
Chetan Chilkunda
Sujin Choi
Joo Heung Yoon
28
0
0
29 Apr 2024
FINEMATCH: Aspect-based Fine-grained Image and Text Mismatch Detection
  and Correction
FINEMATCH: Aspect-based Fine-grained Image and Text Mismatch Detection and Correction
Hang Hua
Jing Shi
Kushal Kafle
Simon Jenni
Daoan Zhang
John Collomosse
Scott D. Cohen
Jiebo Luo
CoGe
VLM
39
9
0
23 Apr 2024
Self-Bootstrapped Visual-Language Model for Knowledge Selection and
  Question Answering
Self-Bootstrapped Visual-Language Model for Knowledge Selection and Question Answering
Dongze Hao
Qunbo Wang
Longteng Guo
Jie Jiang
Jing Liu
14
0
0
22 Apr 2024
BLINK: Multimodal Large Language Models Can See but Not Perceive
BLINK: Multimodal Large Language Models Can See but Not Perceive
Xingyu Fu
Yushi Hu
Bangzheng Li
Yu Feng
Haoyu Wang
Xudong Lin
Dan Roth
Noah A. Smith
Wei-Chiu Ma
Ranjay Krishna
VLM
LRM
MLLM
36
107
0
18 Apr 2024
V2Xum-LLM: Cross-Modal Video Summarization with Temporal Prompt
  Instruction Tuning
V2Xum-LLM: Cross-Modal Video Summarization with Temporal Prompt Instruction Tuning
Hang Hua
Yunlong Tang
Chenliang Xu
Jiebo Luo
VGen
52
22
0
18 Apr 2024
Beyond Embeddings: The Promise of Visual Table in Visual Reasoning
Beyond Embeddings: The Promise of Visual Table in Visual Reasoning
Yiwu Zhong
Zi-Yuan Hu
Michael R. Lyu
Liwei Wang
16
1
0
27 Mar 2024
AVicuna: Audio-Visual LLM with Interleaver and Context-Boundary
  Alignment for Temporal Referential Dialogue
AVicuna: Audio-Visual LLM with Interleaver and Context-Boundary Alignment for Temporal Referential Dialogue
Yunlong Tang
Daiki Shimada
Jing Bi
Chenliang Xu
VGen
14
17
0
24 Mar 2024
Knowledge Condensation and Reasoning for Knowledge-based VQA
Knowledge Condensation and Reasoning for Knowledge-based VQA
Dongze Hao
Jian Jia
Longteng Guo
Qunbo Wang
Te Yang
...
Yanhua Cheng
Bo Wang
Quan Chen
Han Li
Jing Liu
21
0
0
15 Mar 2024
What Is Missing in Multilingual Visual Reasoning and How to Fix It
What Is Missing in Multilingual Visual Reasoning and How to Fix It
Yueqi Song
Simran Khanuja
Graham Neubig
VLM
LRM
76
6
0
03 Mar 2024
All in an Aggregated Image for In-Image Learning
All in an Aggregated Image for In-Image Learning
Lei Wang
Wanyu Xu
Zhiqiang Hu
Yihuai Lan
Shan Dong
Hao Wang
Roy Ka-Wei Lee
Ee-Peng Lim
VLM
40
1
0
28 Feb 2024
CommVQA: Situating Visual Question Answering in Communicative Contexts
CommVQA: Situating Visual Question Answering in Communicative Contexts
N. Naik
Christopher Potts
Elisa Kreiss
CoGe
19
0
0
22 Feb 2024
Cognitive Visual-Language Mapper: Advancing Multimodal Comprehension
  with Enhanced Visual Knowledge Alignment
Cognitive Visual-Language Mapper: Advancing Multimodal Comprehension with Enhanced Visual Knowledge Alignment
Yunxin Li
Xinyu Chen
Baotian Hu
Haoyuan Shi
Min-Ling Zhang
29
0
0
21 Feb 2024
Modality-Aware Integration with Large Language Models for
  Knowledge-based Visual Question Answering
Modality-Aware Integration with Large Language Models for Knowledge-based Visual Question Answering
Junnan Dong
Qinggang Zhang
Huachi Zhou
Daochen Zha
Pai Zheng
Xiao Huang
22
8
0
20 Feb 2024
LangXAI: Integrating Large Vision Models for Generating Textual
  Explanations to Enhance Explainability in Visual Perception Tasks
LangXAI: Integrating Large Vision Models for Generating Textual Explanations to Enhance Explainability in Visual Perception Tasks
Truong Thanh Hung Nguyen
Tobias Clement
Phuc Truong Loc Nguyen
Nils Kemmerzell
Van Binh Truong
V. Nguyen
Mohamed Abdelaal
Hung Cao
VLM
16
2
0
19 Feb 2024
Browse and Concentrate: Comprehending Multimodal Content via prior-LLM
  Context Fusion
Browse and Concentrate: Comprehending Multimodal Content via prior-LLM Context Fusion
Ziyue Wang
Chi Chen
Yiqi Zhu
Fuwen Luo
Peng Li
Ming Yan
Ji Zhang
Fei Huang
Maosong Sun
Yang Janet Liu
31
5
0
19 Feb 2024
Question-Instructed Visual Descriptions for Zero-Shot Video Question
  Answering
Question-Instructed Visual Descriptions for Zero-Shot Video Question Answering
David Romero
Thamar Solorio
93
1
0
16 Feb 2024
Knowledge Generation for Zero-shot Knowledge-based VQA
Knowledge Generation for Zero-shot Knowledge-based VQA
Rui Cao
Jing Jiang
11
2
0
04 Feb 2024
Image-Text Out-Of-Context Detection Using Synthetic Multimodal
  Misinformation
Image-Text Out-Of-Context Detection Using Synthetic Multimodal Misinformation
Fatma Shalabi
H. Nguyen
Hichem Felouat
Ching-Chun Chang
Isao Echizen
19
5
0
29 Jan 2024
Q&A Prompts: Discovering Rich Visual Clues through Mining
  Question-Answer Prompts for VQA requiring Diverse World Knowledge
Q&A Prompts: Discovering Rich Visual Clues through Mining Question-Answer Prompts for VQA requiring Diverse World Knowledge
Haibi Wang
Weifeng Ge
LRM
4
3
0
19 Jan 2024
Cross-modal Retrieval for Knowledge-based Visual Question Answering
Cross-modal Retrieval for Knowledge-based Visual Question Answering
Paul Lerner
Olivier Ferret
C. Guinaudeau
25
7
0
11 Jan 2024
GPT-4V(ision) is a Human-Aligned Evaluator for Text-to-3D Generation
GPT-4V(ision) is a Human-Aligned Evaluator for Text-to-3D Generation
Tong Wu
Guandao Yang
Zhibing Li
Kai Zhang
Ziwei Liu
Leonidas J. Guibas
Dahua Lin
Gordon Wetzstein
EGVM
VGen
8
41
0
08 Jan 2024
CoCoT: Contrastive Chain-of-Thought Prompting for Large Multimodal
  Models with Multiple Image Inputs
CoCoT: Contrastive Chain-of-Thought Prompting for Large Multimodal Models with Multiple Image Inputs
Daoan Zhang
Junming Yang
Hanjia Lyu
Zijian Jin
Yuan Yao
Mingkai Chen
Jiebo Luo
16
33
0
05 Jan 2024
Video Understanding with Large Language Models: A Survey
Video Understanding with Large Language Models: A Survey
Yunlong Tang
Jing Bi
Siting Xu
Luchuan Song
Susan Liang
...
Feng Zheng
Jianguo Zhang
Ping Luo
Jiebo Luo
Chenliang Xu
VLM
47
76
0
29 Dec 2023
Visual Program Distillation: Distilling Tools and Programmatic Reasoning
  into Vision-Language Models
Visual Program Distillation: Distilling Tools and Programmatic Reasoning into Vision-Language Models
Yushi Hu
Otilia Stretcu
Chun-Ta Lu
Krishnamurthy Viswanathan
Kenji Hata
Enming Luo
Ranjay Krishna
Ariel Fuxman
VLM
LRM
MLLM
27
26
0
05 Dec 2023
Compositional Zero-shot Learning via Progressive Language-based
  Observations
Compositional Zero-shot Learning via Progressive Language-based Observations
Lin Li
Guikun Chen
Jun Xiao
Long Chen
14
7
0
23 Nov 2023
Filling the Image Information Gap for VQA: Prompting Large Language
  Models to Proactively Ask Questions
Filling the Image Information Gap for VQA: Prompting Large Language Models to Proactively Ask Questions
Ziyue Wang
Chi Chen
Peng Li
Yang Janet Liu
LRM
8
14
0
20 Nov 2023
Is GPT Powerful Enough to Analyze the Emotions of Memes?
Is GPT Powerful Enough to Analyze the Emotions of Memes?
Jingjing Wang
Joshua Luo
Grace Yang
Allen Hong
Feng Luo
ELM
AI4MH
11
0
0
01 Nov 2023
Large Language Models are Visual Reasoning Coordinators
Large Language Models are Visual Reasoning Coordinators
Liangyu Chen
Bo Li
Sheng Shen
Jingkang Yang
Chunyuan Li
Kurt Keutzer
Trevor Darrell
Ziwei Liu
VLM
LRM
18
46
0
23 Oct 2023
A Simple Baseline for Knowledge-Based Visual Question Answering
A Simple Baseline for Knowledge-Based Visual Question Answering
Alexandros Xenos
Themos Stafylakis
Ioannis Patras
Georgios Tzimiropoulos
61
7
0
20 Oct 2023
VidCoM: Fast Video Comprehension through Large Language Models with
  Multimodal Tools
VidCoM: Fast Video Comprehension through Large Language Models with Multimodal Tools
Huihui Gong
Minjing Dong
Siqi Ma
S. Çamtepe
Chang Xu
Lei Hou
Surya Nepal
VLM
MLLM
37
0
0
16 Oct 2023
ViPE: Visualise Pretty-much Everything
ViPE: Visualise Pretty-much Everything
Hassan Shahmohammadi
Adhiraj Ghosh
Hendrik P. A. Lensch
DiffM
17
1
0
16 Oct 2023
How (not) to ensemble LVLMs for VQA
How (not) to ensemble LVLMs for VQA
Lisa Alazraki
Lluis Castrejon
Mostafa Dehghani
Fantine Huot
J. Uijlings
Thomas Mensink
14
3
0
10 Oct 2023
ViCor: Bridging Visual Understanding and Commonsense Reasoning with
  Large Language Models
ViCor: Bridging Visual Understanding and Commonsense Reasoning with Large Language Models
KAI-QING Zhou
Kwonjoon Lee
Teruhisa Misu
Xin Eric Wang
LRM
16
3
0
09 Oct 2023
12
Next