ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2004.00390
  4. Cited By
More Grounded Image Captioning by Distilling Image-Text Matching Model

More Grounded Image Captioning by Distilling Image-Text Matching Model

1 April 2020
Yuanen Zhou
Meng Wang
Daqing Liu
Zhenzhen Hu
Hanwang Zhang
ArXiv (abs)PDFHTMLGithub (63★)

Papers citing "More Grounded Image Captioning by Distilling Image-Text Matching Model"

46 / 46 papers shown
Title
Group-based Distinctive Image Captioning with Memory Difference Encoding and Attention
Group-based Distinctive Image Captioning with Memory Difference Encoding and Attention
Jiuniu Wang
Wenjia Xu
Qingzhong Wang
Antoni B. Chan
181
0
0
03 Apr 2025
Vitron: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing
Vitron: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing
Hao Fei
Shengqiong Wu
Hao Zhang
Tat-Seng Chua
Shuicheng Yan
173
42
0
31 Dec 2024
Surveying the Landscape of Image Captioning Evaluation: A Comprehensive Taxonomy, Trends and Metrics Analysis
Surveying the Landscape of Image Captioning Evaluation: A Comprehensive Taxonomy, Trends and Metrics Analysis
Uri Berger
Gabriel Stanovsky
Omri Abend
Lea Frermann
71
0
0
09 Aug 2024
Causal Understanding For Video Question Answering
Causal Understanding For Video Question Answering
Bhanu Prakash Reddy Guda
Tanmay Kulkarni
Adithya Sampath
Swarnashree Mysore Sathyendra
CML
76
0
0
23 Jul 2024
Semi-Supervised Image Captioning Considering Wasserstein Graph Matching
Semi-Supervised Image Captioning Considering Wasserstein Graph Matching
Yang Yang
87
0
0
26 Mar 2024
ChatterBox: Multi-round Multimodal Referring and Grounding
ChatterBox: Multi-round Multimodal Referring and Grounding
Yunjie Tian
Tianren Ma
Lingxi Xie
Jihao Qiu
Xi Tang
Yuan Zhang
Jianbin Jiao
Qi Tian
Qixiang Ye
74
14
0
24 Jan 2024
Localized Symbolic Knowledge Distillation for Visual Commonsense Models
Localized Symbolic Knowledge Distillation for Visual Commonsense Models
Jinho Park
Jack Hessel
Khyathi Chandu
Paul Pu Liang
Ximing Lu
...
Youngjae Yu
Qiuyuan Huang
Jianfeng Gao
Ali Farhadi
Yejin Choi
VLM
77
12
0
08 Dec 2023
Ferret: Refer and Ground Anything Anywhere at Any Granularity
Ferret: Refer and Ground Anything Anywhere at Any Granularity
Haoxuan You
Haotian Zhang
Zhe Gan
Xianzhi Du
Bowen Zhang
Zirui Wang
Liangliang Cao
Shih-Fu Chang
Yinfei Yang
ObjDMLLMVLM
118
328
0
11 Oct 2023
A Joint Study of Phrase Grounding and Task Performance in Vision and
  Language Models
A Joint Study of Phrase Grounding and Task Performance in Vision and Language Models
Noriyuki Kojima
Hadar Averbuch-Elor
Yoav Artzi
61
2
0
06 Sep 2023
Exploring Multi-Modal Contextual Knowledge for Open-Vocabulary Object
  Detection
Exploring Multi-Modal Contextual Knowledge for Open-Vocabulary Object Detection
Yifan Xu
Mengdan Zhang
Xiaoshan Yang
Changsheng Xu
ObjD
75
5
0
30 Aug 2023
Kefa: A Knowledge Enhanced and Fine-grained Aligned Speaker for
  Navigation Instruction Generation
Kefa: A Knowledge Enhanced and Fine-grained Aligned Speaker for Navigation Instruction Generation
Haitian Zeng
Xiaohan Wang
Wenguan Wang
Yi Yang
78
7
0
25 Jul 2023
Embedded Heterogeneous Attention Transformer for Cross-lingual Image
  Captioning
Embedded Heterogeneous Attention Transformer for Cross-lingual Image Captioning
Zijie Song
Zhenzhen Hu
Yuanen Zhou
Ye Zhao
Richang Hong
Meng Wang
46
3
0
19 Jul 2023
Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic
Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic
Ke Chen
Zhao Zhang
Weili Zeng
Richong Zhang
Feng Zhu
Rui Zhao
ObjD
121
652
0
27 Jun 2023
Top-Down Framework for Weakly-supervised Grounded Image Captioning
Top-Down Framework for Weakly-supervised Grounded Image Captioning
Chen Cai
Suchen Wang
Kim-Hui Yap
Yi Wang
ObjD
51
3
0
13 Jun 2023
METransformer: Radiology Report Generation by Transformer with Multiple
  Learnable Expert Tokens
METransformer: Radiology Report Generation by Transformer with Multiple Learnable Expert Tokens
Zhanyu Wang
Lingqiao Liu
Lei Wang
Luping Zhou
MedIm
72
76
0
05 Apr 2023
KENGIC: KEyword-driven and N-Gram Graph based Image Captioning
KENGIC: KEyword-driven and N-Gram Graph based Image Captioning
Brandon Birmingham
A. Muscat
49
1
0
07 Feb 2023
What You Say Is What You Show: Visual Narration Detection in
  Instructional Videos
What You Say Is What You Show: Visual Narration Detection in Instructional Videos
Kumar Ashutosh
Rohit Girdhar
Lorenzo Torresani
Kristen Grauman
103
4
0
05 Jan 2023
Do DALL-E and Flamingo Understand Each Other?
Do DALL-E and Flamingo Understand Each Other?
Hang Li
Jindong Gu
Rajat Koner
Sahand Sharifzadeh
Volker Tresp
MLLM
75
12
0
23 Dec 2022
Lesion Guided Explainable Few Weak-shot Medical Report Generation
Lesion Guided Explainable Few Weak-shot Medical Report Generation
Jinghan Sun
Dong Wei
Liansheng Wang
Yefeng Zheng
MedIm
115
13
0
16 Nov 2022
CAMANet: Class Activation Map Guided Attention Network for Radiology
  Report Generation
CAMANet: Class Activation Map Guided Attention Network for Radiology Report Generation
Jun Wang
A. Bhalerao
Terry Yin
Simon See
Yulan He
MedIm
78
18
0
02 Nov 2022
Word to Sentence Visual Semantic Similarity for Caption Generation:
  Lessons Learned
Word to Sentence Visual Semantic Similarity for Caption Generation: Lessons Learned
Ahmed Sabir
119
0
0
26 Sep 2022
ALADIN: Distilling Fine-grained Alignment Scores for Efficient
  Image-Text Matching and Retrieval
ALADIN: Distilling Fine-grained Alignment Scores for Efficient Image-Text Matching and Retrieval
Nicola Messina
Matteo Stefanini
Marcella Cornia
Lorenzo Baraldi
Fabrizio Falchi
Giuseppe Amato
Rita Cucchiara
VLM
40
22
0
29 Jul 2022
Cross-modal Clinical Graph Transformer for Ophthalmic Report Generation
Cross-modal Clinical Graph Transformer for Ophthalmic Report Generation
Mingjie Li
Wenjia Cai
Karin Verspoor
Shirui Pan
Xiaodan Liang
Xiaojun Chang
MedIm
83
37
0
04 Jun 2022
Guiding Visual Question Answering with Attention Priors
Guiding Visual Question Answering with Attention Priors
T. Le
Vuong Le
Sunil R. Gupta
Svetha Venkatesh
T. Tran
61
6
0
25 May 2022
IDEAL: Query-Efficient Data-Free Learning from Black-box Models
IDEAL: Query-Efficient Data-Free Learning from Black-box Models
Jie M. Zhang
Chen Chen
Lingjuan Lyu
106
15
0
23 May 2022
Image Search with Text Feedback by Additive Attention Compositional
  Learning
Image Search with Text Feedback by Additive Attention Compositional Learning
Yuxin Tian
Shawn D. Newsam
K. Boakye
CoGe
65
13
0
08 Mar 2022
Cross-modal Contrastive Distillation for Instructional Activity
  Anticipation
Cross-modal Contrastive Distillation for Instructional Activity Anticipation
Zhengyuan Yang
Jingen Liu
Jing-ling Huang
Xiaodong He
Tao Mei
Chenliang Xu
Jiebo Luo
72
6
0
18 Jan 2022
Compact Bidirectional Transformer for Image Captioning
Compact Bidirectional Transformer for Image Captioning
Yuanen Zhou
Zhenzhen Hu
Daqing Liu
Huixia Ben
Meng Wang
VLM
62
16
0
06 Jan 2022
Neural Attention for Image Captioning: Review of Outstanding Methods
Neural Attention for Image Captioning: Review of Outstanding Methods
Zanyar Zohourianshahzadi
Jugal Kalita
VLM
86
47
0
29 Nov 2021
Less is More: Generating Grounded Navigation Instructions from Landmarks
Less is More: Generating Grounded Navigation Instructions from Landmarks
Su Wang
Ceslee Montgomery
Jordi Orbay
Vighnesh Birodkar
Aleksandra Faust
Izzeddin Gur
Natasha Jaques
Austin Waters
Jason Baldridge
Peter Anderson
135
64
0
25 Nov 2021
UniTAB: Unifying Text and Box Outputs for Grounded Vision-Language
  Modeling
UniTAB: Unifying Text and Box Outputs for Grounded Vision-Language Modeling
Zhengyuan Yang
Zhe Gan
Jianfeng Wang
Xiaowei Hu
Faisal Ahmed
Zicheng Liu
Yumao Lu
Lijuan Wang
135
117
0
23 Nov 2021
Exploiting Cross-Modal Prediction and Relation Consistency for
  Semi-Supervised Image Captioning
Exploiting Cross-Modal Prediction and Relation Consistency for Semi-Supervised Image Captioning
Yang Yang
Haoran Wei
Hengshu Zhu
Dianhai Yu
Hui Xiong
Jian Yang
SSL
27
33
0
22 Oct 2021
Topic Scene Graph Generation by Attention Distillation from Caption
Topic Scene Graph Generation by Attention Distillation from Caption
Wenbin Wang
R. Wang
X. Chen
DiffM
89
14
0
12 Oct 2021
Distributed Attention for Grounded Image Captioning
Distributed Attention for Grounded Image Captioning
Nenglun Chen
Xingjia Pan
Runnan Chen
Lei Yang
Zhiwen Lin
Yuqiang Ren
Haolei Yuan
Xiaowei Guo
Feiyue Huang
Wenping Wang
57
21
0
02 Aug 2021
iReason: Multimodal Commonsense Reasoning using Videos and Natural
  Language with Interpretability
iReason: Multimodal Commonsense Reasoning using Videos and Natural Language with Interpretability
Andrew Wang
Aman Chadha
CML
34
5
0
25 Jun 2021
Semi-Autoregressive Transformer for Image Captioning
Semi-Autoregressive Transformer for Image Captioning
Yuanen Zhou
Yong Zhang
Zhenzhen Hu
Meng Wang
VLM
78
25
0
17 Jun 2021
Revisiting Local Descriptor for Improved Few-Shot Classification
Revisiting Local Descriptor for Improved Few-Shot Classification
J. He
Richang Hong
Xueliang Liu
Mingliang Xu
Qianru Sun
84
28
0
30 Mar 2021
Iterative Shrinking for Referring Expression Grounding Using Deep
  Reinforcement Learning
Iterative Shrinking for Referring Expression Grounding Using Deep Reinforcement Learning
Mingjie Sun
Jimin Xiao
Eng Gee Lim
ObjD
79
35
0
09 Mar 2021
VisualGPT: Data-efficient Adaptation of Pretrained Language Models for
  Image Captioning
VisualGPT: Data-efficient Adaptation of Pretrained Language Models for Image Captioning
Jun Chen
Han Guo
Kai Yi
Boyang Albert Li
Mohamed Elhoseiny
VLM
144
227
0
20 Feb 2021
Macroscopic Control of Text Generation for Image Captioning
Macroscopic Control of Text Generation for Image Captioning
Zhangzi Zhu
Tianlei Wang
Hong Qu
79
4
0
20 Jan 2021
Improving Image Captioning by Leveraging Intra- and Inter-layer Global
  Representation in Transformer Network
Improving Image Captioning by Leveraging Intra- and Inter-layer Global Representation in Transformer Network
Jiayi Ji
Yunpeng Luo
Xiaoshuai Sun
Fuhai Chen
Gen Luo
Yongjian Wu
Yue Gao
Rongrong Ji
ViT
110
176
0
13 Dec 2020
iPerceive: Applying Common-Sense Reasoning to Multi-Modal Dense Video
  Captioning and Video Question Answering
iPerceive: Applying Common-Sense Reasoning to Multi-Modal Dense Video Captioning and Video Question Answering
Aman Chadha
Gurneet Arora
Navpreet Kaloty
66
37
0
16 Nov 2020
Multimodal Research in Vision and Language: A Review of Current and
  Emerging Trends
Multimodal Research in Vision and Language: A Review of Current and Emerging Trends
Shagun Uppal
Sarthak Bhagat
Devamanyu Hazarika
Navonil Majumdar
Soujanya Poria
Roger Zimmermann
Amir Zadeh
101
6
0
19 Oct 2020
VIVO: Visual Vocabulary Pre-Training for Novel Object Captioning
VIVO: Visual Vocabulary Pre-Training for Novel Object Captioning
Xiaowei Hu
Xi Yin
Kevin Qinghong Lin
Lijuan Wang
Lefei Zhang
Jianfeng Gao
Zicheng Liu
VLM
94
56
0
28 Sep 2020
Improving Weakly Supervised Visual Grounding by Contrastive Knowledge
  Distillation
Improving Weakly Supervised Visual Grounding by Contrastive Knowledge Distillation
Liwei Wang
Jing-ling Huang
Yin Li
Kun Xu
Zhengyuan Yang
Dong Yu
ObjD
74
84
0
03 Jul 2020
Deconfounded Image Captioning: A Causal Retrospect
Deconfounded Image Captioning: A Causal Retrospect
Xu Yang
Hanwang Zhang
Jianfei Cai
CML
72
126
0
09 Mar 2020
1