Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
1511.07571
Cited By
DenseCap: Fully Convolutional Localization Networks for Dense Captioning
24 November 2015
Justin Johnson
A. Karpathy
Li Fei-Fei
VLM
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"DenseCap: Fully Convolutional Localization Networks for Dense Captioning"
50 / 468 papers shown
Title
Chunking Strategies for Multimodal AI Systems
Shashanka B R
Mohith Charan R
Seema Banu F
24
0
0
28 Nov 2025
Generating Accurate and Detailed Captions for High-Resolution Images
Hankyeol Lee
Gawon Seo
Kyounggyu Lee
Dogun Kim
Kyungwoo Song
Jiyoung Jung
MLLM
VLM
193
0
0
31 Oct 2025
Top-Down Semantic Refinement for Image Captioning
Jusheng Zhang
Kaitong Cai
Jing Yang
Jian Wang
Chengpei Tang
Keze Wang
DiffM
MLLM
BDL
270
6
0
25 Oct 2025
HouseTour: A Virtual Real Estate A(I)gent
Ata Çelen
Marc Pollefeys
Daniel Barath
Iro Armeni
VGen
205
1
0
20 Oct 2025
MaskCaptioner: Learning to Jointly Segment and Caption Object Trajectories in Videos
Gabriel Fiastre
Antoine Yang
Cordelia Schmid
VOS
385
0
0
16 Oct 2025
Image-to-Video Transfer Learning based on Image-Language Foundation Models: A Comprehensive Survey
Jinxuan Li
Chaolei Tan
Haoxuan Chen
Jianxin Ma
Jian-Fang Hu
Wei-Shi Zheng
Jianhuang Lai
VLM
129
1
0
12 Oct 2025
One Patch to Caption Them All: A Unified Zero-Shot Captioning Framework
Lorenzo Bianchi
Giacomo Pacini
F. Carrara
Nicola Messina
Giuseppe Amato
Fabrizio Falchi
VLM
142
0
0
03 Oct 2025
SPECS: Specificity-Enhanced CLIP-Score for Long Image Caption Evaluation
Xiaofu Chen
Israfel Salazar
Yova Kementchedjhieva
180
1
0
04 Sep 2025
VoCap: Video Object Captioning and Segmentation from Any Prompt
J. Uijlings
Xingyi Zhou
Xiuye Gu
Arsha Nagrani
Anurag Arnab
Alireza Fathi
David A. Ross
Cordelia Schmid
VOS
VLM
232
1
0
29 Aug 2025
Can Mental Imagery Improve the Thinking Capabilities of AI Systems?
Slimane Larabi
LRM
158
0
0
16 Jul 2025
SingaKids: A Multilingual Multimodal Dialogic Tutor for Language Learning
Zhengyuan Liu
Geyu Lin
Hui Li Tan
Huayun Zhang
Yanfeng Lu
...
Stella Xin Yin
He Sun
Hock Huan Goh
Lung Hsiang Wong
Nancy F. Chen
167
3
0
03 Jun 2025
Document-Level Text Generation with Minimum Bayes Risk Decoding using Optimal Transport
Annual Meeting of the Association for Computational Linguistics (ACL), 2025
Yuu Jinnai
OT
171
0
0
29 May 2025
Panoptic Captioning: An Equivalence Bridge for Image and Text
Kun-Yu Lin
Hongjun Wang
Weining Ren
Kai Han
607
0
0
22 May 2025
Describe Anything in Medical Images
Xi Xiao
Yunbei Zhang
Thanh-Huy Nguyen
Ba Thinh Lam
Janet Wang
...
Xiaobei Wang
Xiao Wang
Hao Xu
Tianming Liu
Min Xu
MedIm
VLM
538
11
0
09 May 2025
Survey of Abstract Meaning Representation: Then, Now, Future
Behrooz Mansouri
3DV
884
2
0
06 May 2025
Using Vision Language Models for Safety Hazard Identification in Construction
Muhammad Adil
Gaang Lee
Vicente A. Gonzalez
Qipei Mei
273
7
0
12 Apr 2025
URECA: Unique Region Caption Anything
Sangbeom Lim
J. Kim
Heeji Yoon
Jaewoo Jung
Seungryong Kim
264
1
0
07 Apr 2025
ExCap3D: Expressive 3D Scene Understanding via Object Captioning with Varying Detail
Chandan Yeshwanth
Dávid Rozenberszki
Angela Dai
260
3
0
21 Mar 2025
RTGen: Real-Time Generative Detection Transformer
Chi Ruan
Jiying Zhao
Wenhu Chen
ObjD
VLM
368
0
0
28 Feb 2025
Robin3D: Improving 3D Large Language Model via Robust Instruction Tuning
Weitai Kang
Haifeng Huang
Yuzhang Shang
Mubarak Shah
Yan Yan
319
18
0
21 Feb 2025
Benchmarking Large and Small MLLMs
Xuelu Feng
Yunsheng Li
DongDong Chen
Mei Gao
Mengchen Liu
Junsong Yuan
Chunming Qiao
119
3
0
04 Jan 2025
PunchBench: Benchmarking MLLMs in Multimodal Punchline Comprehension
Annual Meeting of the Association for Computational Linguistics (ACL), 2024
Kun Ouyang
Yuanxin Liu
Shicheng Li
Yi Liu
Hao Zhou
Fandong Meng
Jie Zhou
Xu Sun
339
1
0
16 Dec 2024
Detailed Object Description with Controllable Dimensions
IEEE transactions on multimedia (IEEE TMM), 2024
Xinran Wang
Hao Zhang
Baoteng Li
Kongming Liang
Hao Sun
Zhongjiang He
Tianhao Shen
Jun Guo
293
1
0
28 Nov 2024
CapHDR2IR: Caption-Driven Transfer from Visible Light to Infrared Domain
Jingchao Peng
Thomas Bashford-Rogers
Zhuang Shao
Haitao Zhao
Aru Ranjan Singh
Abhishek Goswami
Kurt Debattista
210
0
0
25 Nov 2024
FINECAPTION: Compositional Image Captioning Focusing on Wherever You Want at Any Granularity
Computer Vision and Pattern Recognition (CVPR), 2024
Hang Hua
Qing Liu
Lingzhi Zhang
Jing Shi
Zhifei Zhang
Yilin Wang
Jianming Zhang
Jiebo Luo
CoGe
VLM
288
17
0
23 Nov 2024
ComiCap: A VLMs pipeline for dense captioning of Comic Panels
Emanuele Vivoli
Niccoló Biondi
Marco Bertini
Dimosthenis Karatzas
193
4
0
24 Sep 2024
TheraGen: Therapy for Every Generation
Kartikey Doshi
Jimit Shah
Narendra Shekokar
AI4MH
149
0
0
12 Sep 2024
Question-Answering Dense Video Events
Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), 2024
Hangyu Qin
Junbin Xiao
Angela Yao
VLM
465
6
0
06 Sep 2024
TRRG: Towards Truthful Radiology Report Generation With Cross-modal Disease Clue Enhanced Large Language Model
Yuhao Wang
Chao Hao
Yawen Cui
Xinqi Su
Weicheng Xie
Tao Tan
Zitong Yu
LM&MA
MedIm
174
1
0
22 Aug 2024
ProgramAlly: Creating Custom Visual Access Programs via Multi-Modal End-User Programming
ACM Symposium on User Interface Software and Technology (UIST), 2024
Jaylin Herskovitz
Andi Xu
Rahaf Alharbi
Anhong Guo
108
5
0
20 Aug 2024
Surveying the Landscape of Image Captioning Evaluation: A Comprehensive Taxonomy, Trends and Metrics Analysis
Uri Berger
Gabriel Stanovsky
Omri Abend
Lea Frermann
340
0
0
09 Aug 2024
COM Kitchens: An Unedited Overhead-view Video Dataset as a Vision-Language Benchmark
European Conference on Computer Vision (ECCV), 2024
Koki Maeda
Tosho Hirasawa
Atsushi Hashimoto
Jun Harashima
Leszek Rybicki
Yusuke Fukasawa
Yoshitaka Ushiku
251
3
0
05 Aug 2024
Can Textual Semantics Mitigate Sounding Object Segmentation Preference?
Yaoting Wang
Peiwen Sun
Yuanchao Li
Honggang Zhang
Di Hu
272
12
0
15 Jul 2024
Emergent Visual-Semantic Hierarchies in Image-Text Representations
Morris Alper
Hadar Averbuch-Elor
VLM
362
15
0
11 Jul 2024
Certainly Uncertain: A Benchmark and Metric for Multimodal Epistemic and Aleatoric Awareness
Khyathi Chandu
Linjie Li
Anas Awadalla
Ximing Lu
Jae Sung Park
Jack Hessel
Lijuan Wang
Yejin Choi
293
6
0
02 Jul 2024
Details Make a Difference: Object State-Sensitive Neurorobotic Task Planning
International Conference on Artificial Neural Networks (ICANN), 2024
Xiaowen Sun
Xufeng Zhao
Jae Hee Lee
Wenhao Lu
Matthias Kerzel
Stefan Wermter
LM&Ro
205
4
0
14 Jun 2024
Image Textualization: An Automatic Framework for Creating Accurate and Detailed Image Descriptions
Renjie Pi
Jianshu Zhang
Jipeng Zhang
Boyao Wang
Zhekai Chen
Tong Zhang
3DV
190
32
0
11 Jun 2024
DynRefer: Delving into Region-level Multimodal Tasks via Dynamic Resolution
Yuzhong Zhao
Feng Liu
Yue Liu
Mingxiang Liao
Chen Gong
QiXiang Ye
Fang Wan
ObjD
159
0
0
25 May 2024
LaDiC: Are Diffusion Models Really Inferior to Autoregressive Counterparts for Image-to-Text Generation?
Yuchi Wang
Shuhuai Ren
Rundong Gao
Linli Yao
Qingyan Guo
Kaikai An
Jianhong Bai
Xu Sun
DiffM
VLM
236
14
0
16 Apr 2024
DetCLIPv3: Towards Versatile Generative Open-vocabulary Object Detection
Lewei Yao
Renjie Pi
Jianhua Han
Xiaodan Liang
Hang Xu
Wei Zhang
Zhenguo Li
Dan Xu
VLM
ObjD
240
43
0
14 Apr 2024
LocCa: Visual Pretraining with Location-aware Captioners
Bo Wan
Michael Tschannen
Yongqin Xian
Filip Pavetić
Ibrahim Alabdulmohsin
Xiao Wang
André Susano Pinto
Andreas Steiner
Lucas Beyer
Xiao-Qi Zhai
VLM
340
20
0
28 Mar 2024
Entity6K: A Large Open-Domain Evaluation Dataset for Real-World Entity Recognition
Jielin Qiu
William Jongwon Han
Winfred Wang
Zhengyuan Yang
Linjie Li
Jianfeng Wang
Christos Faloutsos
Lei Li
Lijuan Wang
VLM
237
3
0
19 Mar 2024
Generative Region-Language Pretraining for Open-Ended Object Detection
Computer Vision and Pattern Recognition (CVPR), 2024
Chuang Lin
Yi Jiang
Zhuang Li
Zehuan Yuan
Jianfei Cai
ObjD
VLM
190
27
0
15 Mar 2024
TutoAI: A Cross-domain Framework for AI-assisted Mixed-media Tutorial Creation on Physical Tasks
International Conference on Human Factors in Computing Systems (CHI), 2024
Yuexi Chen
Vlad I. Morariu
Anh Truong
Zhicheng Liu
DiffM
VGen
218
9
0
12 Mar 2024
A Comprehensive Survey of 3D Dense Captioning: Localizing and Describing Objects in 3D Scenes
Ting Yu
Xiaojun Lin
Shuhui Wang
Weiguo Sheng
Qingming Huang
Jun-chen Yu
3DV
208
16
0
12 Mar 2024
AICAttack: Adversarial Image Captioning Attack with Attention-Based Optimization
Jiyao Li
Mingze Ni
Yifei Dong
Tianqing Zhu
Wei Liu
AAML
178
4
0
19 Feb 2024
FuseMoE: Mixture-of-Experts Transformers for Fleximodal Fusion
Xing Han
Huy Nguyen
Carl Harris
Nhat Ho
Suchi Saria
MoE
360
45
0
05 Feb 2024
ControlCap: Controllable Region-level Captioning
Yuzhong Zhao
Yue Liu
Zonghao Guo
Weijia Wu
Chen Gong
Fang Wan
QiXiang Ye
356
14
0
31 Jan 2024
Towards Unified Interactive Visual Grounding in The Wild
Jie Xu
Hanbo Zhang
Qingyi Si
Yifeng Li
Xuguang Lan
Tao Kong
LM&Ro
262
5
0
30 Jan 2024
Enhancing Multimodal Understanding with CLIP-Based Image-to-Text Transformation
Change Che
Qunwei Lin
Xinyu Zhao
Jiaxin Huang
Liqiang Yu
VLM
135
50
0
02 Jan 2024
1
2
3
4
...
8
9
10
Next