Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
1511.07571
Cited By
DenseCap: Fully Convolutional Localization Networks for Dense Captioning
24 November 2015
Justin Johnson
A. Karpathy
Li Fei-Fei
VLM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"DenseCap: Fully Convolutional Localization Networks for Dense Captioning"
50 / 452 papers shown
Title
Describe Anything in Medical Images
Xi Xiao
Yunbei Zhang
Thanh-Huy Nguyen
Ba Thinh Lam
Janet Wang
...
Xingjian Li
X. U. Wang
Hao Xu
Tianming Liu
Min Xu
MedIm
VLM
46
0
0
09 May 2025
Survey of Abstract Meaning Representation: Then, Now, Future
Behrooz Mansouri
3DV
110
0
0
06 May 2025
Using Vision Language Models for Safety Hazard Identification in Construction
Muhammad Adil
Gaang Lee
Vicente A. Gonzalez
Qipei Mei
36
1
0
12 Apr 2025
URECA: Unique Region Caption Anything
Sangbeom Lim
J. Kim
Heeji Yoon
Jaewoo Jung
Seungryong Kim
29
0
0
07 Apr 2025
ExCap3D: Expressive 3D Scene Understanding via Object Captioning with Varying Detail
Chandan Yeshwanth
Dávid Rozenberszki
Angela Dai
77
0
0
21 Mar 2025
RTGen: Real-Time Generative Detection Transformer
Chi Ruan
ObjD
VLM
47
0
0
28 Feb 2025
Robin3D: Improving 3D Large Language Model via Robust Instruction Tuning
Weitai Kang
Haifeng Huang
Yuzhang Shang
Mubarak Shah
Yan Yan
46
7
0
21 Feb 2025
PunchBench: Benchmarking MLLMs in Multimodal Punchline Comprehension
Kun Ouyang
Yuanxin Liu
Shicheng Li
Yi Liu
Hao Zhou
Fandong Meng
Jie Zhou
Xu Sun
102
1
0
16 Dec 2024
CapHDR2IR: Caption-Driven Transfer from Visible Light to Infrared Domain
Jingchao Peng
Thomas Bashford-Rogers
Zhuang Shao
Haitao Zhao
Aru Ranjan Singh
Abhishek Goswami
Kurt Debattista
64
0
0
25 Nov 2024
FINECAPTION: Compositional Image Captioning Focusing on Wherever You Want at Any Granularity
Hang Hua
Qing Liu
Lingzhi Zhang
Jing Shi
Zhifei Zhang
Yilin Wang
Jianming Zhang
Jiebo Luo
CoGe
VLM
90
6
0
23 Nov 2024
ComiCap: A VLMs pipeline for dense captioning of Comic Panels
Emanuele Vivoli
Niccoló Biondi
Marco Bertini
Dimosthenis Karatzas
38
3
0
24 Sep 2024
TheraGen: Therapy for Every Generation
Kartikey Doshi
Jimit Shah
Narendra Shekokar
AI4MH
18
0
0
12 Sep 2024
Question-Answering Dense Video Events
Hangyu Qin
Junbin Xiao
Angela Yao
VLM
71
1
0
06 Sep 2024
TRRG: Towards Truthful Radiology Report Generation With Cross-modal Disease Clue Enhanced Large Language Model
Yuhao Wang
Chao Hao
Yawen Cui
Xinqi Su
Weicheng Xie
Tao Tan
Zitong Yu
LM&MA
MedIm
28
0
0
22 Aug 2024
ProgramAlly: Creating Custom Visual Access Programs via Multi-Modal End-User Programming
Jaylin Herskovitz
Andi Xu
Rahaf Alharbi
Anhong Guo
18
2
0
20 Aug 2024
Surveying the Landscape of Image Captioning Evaluation: A Comprehensive Taxonomy, Trends and Metrics Analysis
Uri Berger
Gabriel Stanovsky
Omri Abend
Lea Frermann
27
0
0
09 Aug 2024
COM Kitchens: An Unedited Overhead-view Video Dataset as a Vision-Language Benchmark
Koki Maeda
Tosho Hirasawa
Atsushi Hashimoto
Jun Harashima
Leszek Rybicki
Yusuke Fukasawa
Yoshitaka Ushiku
43
0
0
05 Aug 2024
Can Textual Semantics Mitigate Sounding Object Segmentation Preference?
Yaoting Wang
Peiwen Sun
Yuanchao Li
Honggang Zhang
Di Hu
38
5
0
15 Jul 2024
Emergent Visual-Semantic Hierarchies in Image-Text Representations
Morris Alper
Hadar Averbuch-Elor
VLM
32
6
0
11 Jul 2024
Certainly Uncertain: A Benchmark and Metric for Multimodal Epistemic and Aleatoric Awareness
Khyathi Raghavi Chandu
Linjie Li
Anas Awadalla
Ximing Lu
Jae Sung Park
Jack Hessel
Lijuan Wang
Yejin Choi
41
2
0
02 Jul 2024
Details Make a Difference: Object State-Sensitive Neurorobotic Task Planning
Xiaowen Sun
Xufeng Zhao
Jae Hee Lee
Wenhao Lu
Matthias Kerzel
Stefan Wermter
LM&Ro
34
2
0
14 Jun 2024
Image Textualization: An Automatic Framework for Creating Accurate and Detailed Image Descriptions
Renjie Pi
Jianshu Zhang
Jipeng Zhang
Rui Pan
Zhekai Chen
Tong Zhang
3DV
42
19
0
11 Jun 2024
LaDiC: Are Diffusion Models Really Inferior to Autoregressive Counterparts for Image-to-Text Generation?
Yuchi Wang
Shuhuai Ren
Rundong Gao
Linli Yao
Qingyan Guo
Kaikai An
Jianhong Bai
Xu Sun
DiffM
VLM
36
6
0
16 Apr 2024
DetCLIPv3: Towards Versatile Generative Open-vocabulary Object Detection
Lewei Yao
Renjie Pi
Jianhua Han
Xiaodan Liang
Hang Xu
Wei Zhang
Zhenguo Li
Dan Xu
VLM
ObjD
45
20
0
14 Apr 2024
LocCa: Visual Pretraining with Location-aware Captioners
Bo Wan
Michael Tschannen
Yongqin Xian
Filip Pavetić
Ibrahim M. Alabdulmohsin
Xiao Wang
André Susano Pinto
Andreas Steiner
Lucas Beyer
Xiao-Qi Zhai
VLM
40
6
0
28 Mar 2024
Entity6K: A Large Open-Domain Evaluation Dataset for Real-World Entity Recognition
Jielin Qiu
William Jongwon Han
Winfred Wang
Zhengyuan Yang
Linjie Li
Jianfeng Wang
Christos Faloutsos
Lei Li
Lijuan Wang
VLM
56
2
0
19 Mar 2024
Generative Region-Language Pretraining for Open-Ended Object Detection
Chuang Lin
Yi-Xin Jiang
Lizhen Qu
Zehuan Yuan
Jianfei Cai
ObjD
VLM
46
13
0
15 Mar 2024
TutoAI: A Cross-domain Framework for AI-assisted Mixed-media Tutorial Creation on Physical Tasks
Yuexi Chen
Vlad I. Morariu
Anh Truong
Zhicheng Liu
DiffM
VGen
31
3
0
12 Mar 2024
A Comprehensive Survey of 3D Dense Captioning: Localizing and Describing Objects in 3D Scenes
Ting Yu
Xiaojun Lin
Shuhui Wang
Weiguo Sheng
Qingming Huang
Jun-chen Yu
3DV
37
10
0
12 Mar 2024
AICAttack: Adversarial Image Captioning Attack with Attention-Based Optimization
Jiyao Li
Mingze Ni
Yifei Dong
Tianqing Zhu
Wei Liu
AAML
27
2
0
19 Feb 2024
FuseMoE: Mixture-of-Experts Transformers for Fleximodal Fusion
Xing Han
Huy Nguyen
Carl Harris
Nhat Ho
S. Saria
MoE
69
16
0
05 Feb 2024
ControlCap: Controllable Region-level Captioning
Yuzhong Zhao
Yue Liu
Zonghao Guo
Weijia Wu
Chen Gong
Fang Wan
QiXiang Ye
24
5
0
31 Jan 2024
Towards Unified Interactive Visual Grounding in The Wild
Jie Xu
Hanbo Zhang
Qingyi Si
Yifeng Li
Xuguang Lan
Tao Kong
LM&Ro
30
5
0
30 Jan 2024
Enhancing Multimodal Understanding with CLIP-Based Image-to-Text Transformation
Change Che
Qunwei Lin
Xinyu Zhao
Jiaxin Huang
Liqiang Yu
VLM
15
37
0
02 Jan 2024
Cycle-Consistency Learning for Captioning and Grounding
Ning Wang
Jiajun Deng
Mingbo Jia
ObjD
23
7
0
23 Dec 2023
Multi-Clue Reasoning with Memory Augmentation for Knowledge-based Visual Question Answering
Chengxiang Yin
Zhengping Che
Kun Wu
Zhiyuan Xu
Jian Tang
15
0
0
20 Dec 2023
Pixel Aligned Language Models
Jiarui Xu
Xingyi Zhou
Shen Yan
Xiuye Gu
Anurag Arnab
Chen Sun
Xiaolong Wang
Cordelia Schmid
MLLM
VLM
43
14
0
14 Dec 2023
Localized Symbolic Knowledge Distillation for Visual Commonsense Models
J. Park
Jack Hessel
Khyathi Raghavi Chandu
Paul Pu Liang
Ximing Lu
...
Youngjae Yu
Qiuyuan Huang
Jianfeng Gao
Ali Farhadi
Yejin Choi
VLM
19
11
0
08 Dec 2023
Towards More Unified In-context Visual Understanding
Dianmo Sheng
Dongdong Chen
Zhentao Tan
Qiankun Liu
Qi Chu
Jianmin Bao
Tao Gong
Bin Liu
Shengwei Xu
Nenghai Yu
MLLM
VLM
24
10
0
05 Dec 2023
Object Recognition as Next Token Prediction
Kaiyu Yue
Borchun Chen
Jonas Geiping
Hengduo Li
Tom Goldstein
Ser-Nam Lim
22
8
0
04 Dec 2023
Segment and Caption Anything
Xiaoke Huang
Jianfeng Wang
Yansong Tang
Zheng Zhang
Han Hu
Jiwen Lu
Lijuan Wang
Zicheng Liu
MLLM
VLM
26
18
0
01 Dec 2023
Contrastive Vision-Language Alignment Makes Efficient Instruction Learner
Lizhao Liu
Xinyu Sun
Tianhang Xiang
Zhuangwei Zhuang
Liuren Yin
Mingkui Tan
VLM
19
2
0
29 Nov 2023
GOAT: GO to Any Thing
Matthew Chang
Théophile Gervet
Mukul Khanna
Sriram Yenamandra
Dhruv Shah
...
Saurabh Gupta
Dhruv Batra
Roozbeh Mottaghi
Jitendra Malik
Devendra Singh Chaplot
26
65
0
10 Nov 2023
FAITHSCORE: Evaluating Hallucinations in Large Vision-Language Models
Liqiang Jing
Ruosen Li
Yunmo Chen
Mengzhao Jia
Xinya Du
MLLM
16
6
0
02 Nov 2023
Generating Context-Aware Natural Answers for Questions in 3D Scenes
Mohammed Munzer Dwedari
Matthias Niessner
Dave Zhenyu Chen
22
1
0
30 Oct 2023
Open Visual Knowledge Extraction via Relation-Oriented Multimodality Model Prompting
Hejie Cui
Xinyu Fang
Zihan Zhang
Ran Xu
Xuan Kan
Xin Liu
Yue Yu
Manling Li
Yangqiu Song
Carl Yang
VLM
15
4
0
28 Oct 2023
InViG: Benchmarking Interactive Visual Grounding with 500K Human-Robot Interactions
Hanbo Zhang
Jie Xu
Yuchen Mo
Tao Kong
17
1
0
18 Oct 2023
Prototype-based Aleatoric Uncertainty Quantification for Cross-modal Retrieval
Hao Li
Marie-Jeanne Lesot
Lianli Gao
Xiaosu Zhu
Christophe Marsala
EDL
14
11
0
29 Sep 2023
Rank2Tell: A Multimodal Driving Dataset for Joint Importance Ranking and Reasoning
Enna Sachdeva
Nakul Agarwal
Suhas Chundi
Sean Roelofs
Jiachen Li
Mykel Kochenderfer
Chiho Choi
Behzad Dariush
25
47
0
12 Sep 2023
Towards Real Time Egocentric Segment Captioning for The Blind and Visually Impaired in RGB-D Theatre Images
Khadidja Delloul
S. Larabi
18
2
0
26 Aug 2023
1
2
3
4
...
8
9
10
Next