Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
1411.5726
Cited By
v1
v2 (latest)
CIDEr: Consensus-based Image Description Evaluation
Computer Vision and Pattern Recognition (CVPR), 2014
20 November 2014
Ramakrishna Vedantam
C. L. Zitnick
Devi Parikh
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"CIDEr: Consensus-based Image Description Evaluation"
50 / 2,346 papers shown
Title
Q-Adapter: Visual Query Adapter for Extracting Textually-related Features in Video Captioning
Junan Chen
Trung Thanh Nguyen
Takahiro Komamizu
Ichiro Ide
28
0
0
11 Oct 2025
CapGeo: A Caption-Assisted Approach to Geometric Reasoning
Y. Li
Siyi Qian
Hao Liang
Leqi Zheng
Ruichuan An
Yongzhen Guo
Wentao Zhang
ReLM
LRM
88
0
0
10 Oct 2025
Metric Calculating Benchmark: Code-Verifiable Complicate Instruction Following Benchmark for Large Language Models
Hyeonseok Moon
Seongtae Hong
Jaehyung Seo
Heuiseok Lim
ALM
100
0
0
09 Oct 2025
Addressing the ID-Matching Challenge in Long Video Captioning
Zhantao Yang
Huangji Wang
Ruili Feng
Han Zhang
Yuting Hu
Shangwen Zhu
Junyan Li
Yu Liu
Fan Cheng
72
0
0
08 Oct 2025
Uncertainty in Machine Learning
Hans Weytjens
Wouter Verbeke
UD
193
0
0
07 Oct 2025
AURA Score: A Metric For Holistic Audio Question Answering Evaluation
Satvik Dixit
Soham Deshmukh
Bhiksha Raj
92
0
0
06 Oct 2025
Reward Models are Metrics in a Trench Coat
Sebastian Gehrmann
108
0
0
03 Oct 2025
One Patch to Caption Them All: A Unified Zero-Shot Captioning Framework
Lorenzo Bianchi
Giacomo Pacini
F. Carrara
Nicola Messina
Giuseppe Amato
Fabrizio Falchi
VLM
98
0
0
03 Oct 2025
MCM-DPO: Multifaceted Cross-Modal Direct Preference Optimization for Alt-text Generation
Jinlan Fu
Shenzhen Huangfu
Hao Fei
Yichong Huang
Xiaoyu Shen
Xipeng Qiu
See-Kiong Ng
57
0
0
01 Oct 2025
What You See is What You Ask: Evaluating Audio Descriptions
Divy Kala
Eshika Khandelwal
Makarand Tapaswi
DiffM
90
1
0
01 Oct 2025
VELA: An LLM-Hybrid-as-a-Judge Approach for Evaluating Long Image Captions
Kazuki Matsuda
Yuiga Wada
Shinnosuke Hirano
Seitaro Otsuki
Komei Sugiura
VLM
112
0
0
30 Sep 2025
FinCap: Topic-Aligned Captions for Short-Form Financial YouTube Videos
Siddhant Sukhani
Yash Bhardwaj
Riya Bhadani
Veer Kejriwal
Michael Galarnyk
Sudheer Chava
56
0
0
30 Sep 2025
When Audio Generators Become Good Listeners: Generative Features for Understanding Tasks
Zeyu Xie
Chenxing Li
Xuenan Xu
Mengyue Wu
Wenfu Wang
Ruibo Fu
Meng Yu
Dong Yu
Yuexian Zou
104
0
0
29 Sep 2025
Saliency Guided Longitudinal Medical Visual Question Answering
Jialin Wu
Xiaofeng Liu
MedIm
122
0
0
29 Sep 2025
Diff-3DCap: Shape Captioning with Diffusion Models
IEEE Transactions on Visualization and Computer Graphics (TVCG), 2025
Zhenyu Shu
Jiawei Wen
Shiyang Li
Shiqing Xin
Ligang Liu
DiffM
87
0
0
28 Sep 2025
RCI: A Score for Evaluating Global and Local Reasoning in Multimodal Benchmarks
Amit Agarwal
Hitesh Laxmichand Patel
Srikant Panda
Hansa Meghwani
Jyotika Singh
Karan Dua
Paul Li
Tao Sheng
Sujith Ravi
Dan Roth
LRM
90
3
0
28 Sep 2025
AttAnchor: Guiding Cross-Modal Token Alignment in VLMs with Attention Anchors
Junyang Zhang
Tianyi Zhu
Thierry Tambe
40
0
0
27 Sep 2025
Live-E2T: Real-time Threat Monitoring in Video via Deduplicated Event Reasoning and Chain-of-Thought
Yuhan Wang
Cheng Liu
Zihan Zhao
Weichao Wu
69
0
0
23 Sep 2025
Citrus-V: Advancing Medical Foundation Models with Unified Medical Image Grounding for Clinical Reasoning
Guoxin Wang
Jun Zhao
Xinyi Liu
Yanbo Liu
Xuyang Cao
...
Zhuoyun Liu
Qintian Sun
Fangru Zhou
Haoqiang Xing
Zhenhong Yang
LRM
126
1
0
23 Sep 2025
RadEval: A framework for radiology text evaluation
Justin Xu
Xi Zhang
Javid Abderezaei
Julie Bauml
Roger Boodoo
...
Eric Brattain
Dave Van Veen
Zaiqiao Meng
David Eyre
Jean-Benoit Delbrouck
LM&MA
116
1
0
22 Sep 2025
Text-Scene: A Scene-to-Language Parsing Framework for 3D Scene Understanding
Haoyuan Li
Rui Liu
Hehe Fan
Yi Yang
LM&Ro
62
0
0
20 Sep 2025
Advancing Reference-free Evaluation of Video Captions with Factual Analysis
Shubhashis Roy Dipta
Tz-Ying Wu
Subarna Tripathi
92
0
0
20 Sep 2025
RACap: Relation-Aware Prompting for Lightweight Retrieval-Augmented Image Captioning
Xiaosheng Long
Hanyu Wang
Zhentao Song
Kun Luo
Hongde Liu
84
0
0
19 Sep 2025
Spatial-CLAP: Learning Spatially-Aware audio--text Embeddings for Multi-Source Conditions
Kentaro Seki
Yuki Okamoto
Kouei Yamaoka
Yuki Saito
Shinnosuke Takamichi
Hiroshi Saruwatari
73
0
0
18 Sep 2025
Aligning Audio Captions with Human Preferences
Kartik Hegde
Rehana Mahfuz
Yinyi Guo
Erik M. Visser
64
0
0
18 Sep 2025
VisMoDAl: Visual Analytics for Evaluating and Improving Corruption Robustness of Vision-Language Models
Huanchen Wang
Wencheng Zhang
Zhiqiang Wang
Zhicong Lu
Yuxin Ma
91
0
0
18 Sep 2025
ResidualViT for Efficient Temporally Dense Video Encoding
Mattia Soldan
Fabian Caba Heilbron
Bernard Ghanem
Josef Sivic
Bryan C. Russell
137
0
0
16 Sep 2025
Evaluating Robustness of Vision-Language Models Under Noisy Conditions
Purushoth
Alireza
AAML
76
0
0
15 Sep 2025
Character-Centric Understanding of Animated Movies
Zhongrui Gui
Junyu Xie
Tengda Han
Weidi Xie
Andrew Zisserman
72
0
0
15 Sep 2025
Lost in Embeddings: Information Loss in Vision-Language Models
Wenyan Li
Raphael Tang
Chengzu Li
Caiqi Zhang
Ivan Vulić
Anders Søgaard
VLM
103
5
0
15 Sep 2025
Towards Understanding Visual Grounding in Visual Language Models
Georgios Pantazopoulos
Eda B. Özyiğit
ObjD
228
1
0
12 Sep 2025
Teaching AI Stepwise Diagnostic Reasoning with Report-Guided Chain-of-Thought Learning
Yihong Luo
Wenwu He
Zhuo-Xu Cui
Dong Liang
LM&MA
LRM
39
0
0
08 Sep 2025
SPECS: Specificity-Enhanced CLIP-Score for Long Image Caption Evaluation
Xiaofu Chen
Israfel Salazar
Yova Kementchedjhieva
156
1
0
04 Sep 2025
Sample-efficient Integration of New Modalities into Large Language Models
Osman Batur İnce
André F. T. Martins
Oisin Mac Aodha
Edoardo M. Ponti
MLLM
112
0
0
04 Sep 2025
Aesthetic Image Captioning with Saliency Enhanced MLLMs
Yilin Tao
Jiashui Huang
Huaze Xu
Ling Shao
229
0
0
04 Sep 2025
Sali4Vid: Saliency-Aware Video Reweighting and Adaptive Caption Retrieval for Dense Video Captioning
MinJu Jeon
Si-Woo Kim
Ye-Chan Kim
HyunGee Kim
Dong-Jin Kim
VGen
107
0
0
04 Sep 2025
Time-Scaling State-Space Models for Dense Video Captioning
A. Piergiovanni
Ganesh Mallya
Dahun Kim
A. Angelova
92
0
0
03 Sep 2025
ReCap: Event-Aware Image Captioning with Article Retrieval and Semantic Gaussian Normalization
T. Nguyen
Thanh-Tung Phan-Nguyen
Gia-Huy Dinh
Lam-Huy Nguyen
M. Tran
T. Le
60
0
0
01 Sep 2025
RT-VLM: Re-Thinking Vision Language Model with 4-Clues for Real-World Object Recognition Robustness
Junghyun Park
Tuan Anh Nguyen
Dugki Min
VLM
92
0
0
01 Sep 2025
SurgLLM: A Versatile Large Multimodal Model with Spatial Focus and Temporal Awareness for Surgical Video Understanding
Zhen Chen
Xingjian Luo
Kun Yuan
J. Wu
Danny Tat Ming Chan
Nassir Navab
Hongbin Liu
Zhen Lei
Jiebo Luo
156
2
0
30 Aug 2025
MM-SeR: Multimodal Self-Refinement for Lightweight Image Captioning
Junha Song
Yongsik Jo
So Yeon Min
Quanting Xie
Taehwan Kim
Yonatan Bisk
Jaegul Choo
VLM
132
0
0
29 Aug 2025
VoCap: Video Object Captioning and Segmentation from Any Prompt
J. Uijlings
Xingyi Zhou
Xiuye Gu
Arsha Nagrani
Anurag Arnab
Alireza Fathi
David A. Ross
Cordelia Schmid
VOS
VLM
184
1
0
29 Aug 2025
Event-Enriched Image Analysis Grand Challenge at ACM Multimedia 2025
T. Tran
Minh-Quang Nguyen
Minh-Triet Tran
Tam V. Nguyen
Trong-Le Do
Duy-Nam Ly
Viet-Tham Huynh
Khanh-Duy Le
Mai-Khiem Tran
Trung-Truc Huynh-Le
VGen
68
0
0
26 Aug 2025
Enhancing Model Privacy in Federated Learning with Random Masking and Quantization
Zhibo Xu
Jianhao Zhu
Jingwen Xu
Changze Lv
Zisu Huang
Xiaohua Wang
Muling Wu
Qi Qian
Xiaoqing Zheng
Xuanjing Huang
FedML
166
0
0
26 Aug 2025
From Global to Local: Social Bias Transfer in CLIP
Ryan Ramos
Yusuke Hirota
Yuta Nakashima
Noa Garcia
72
0
0
25 Aug 2025
GM-Skip: Metric-Guided Transformer Block Skipping for Efficient Vision-Language Models
Lianming Huang
Haibo Hu
Qiao Li
Xin He
Nan Guan
Chun Jason Xue
VLM
93
0
0
20 Aug 2025
Structured Prompting and Multi-Agent Knowledge Distillation for Traffic Video Interpretation and Risk Inference
Yunxiang Yang
Ningning Xu
Jidong J. Yang
72
0
0
19 Aug 2025
Region-Level Context-Aware Multimodal Understanding
Hongliang Wei
Xianqi Zhang
Xingtao Wang
Xiaopeng Fan
Debin Zhao
VLM
125
0
0
17 Aug 2025
Are Large Pre-trained Vision Language Models Effective Construction Safety Inspectors?
Xuezheng Chen
Zhengbo Zou
MLLM
68
0
0
14 Aug 2025
GoViG: Goal-Conditioned Visual Navigation Instruction Generation
Fengyi Wu
Yifei Dong
Zhi-Qi Cheng
Yilong Dai
Guangyu Chen
Hang Wang
Jingdong Sun
Alexander G. Hauptmann
96
2
0
13 Aug 2025
Previous
1
2
3
4
5
...
45
46
47
Next