Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
1411.5726
Cited By
v1
v2 (latest)
CIDEr: Consensus-based Image Description Evaluation
Computer Vision and Pattern Recognition (CVPR), 2014
20 November 2014
Ramakrishna Vedantam
C. L. Zitnick
Devi Parikh
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"CIDEr: Consensus-based Image Description Evaluation"
50 / 2,353 papers shown
VLM-Assisted Continual learning for Visual Question Answering in Self-Driving
Yuxin Lin
Mengshi Qi
Liang Liu
Huadong Ma
CLL
291
4
0
02 Feb 2025
Mobile Manipulation Instruction Generation from Multiple Images with Automatic Metric Enhancement
IEEE Robotics and Automation Letters (IEEE RA-L), 2025
Kei Katsumata
Motonari Kambara
Daichi Yashima
Ryosuke Korekata
Komei Sugiura
420
0
0
28 Jan 2025
An Ensemble Model with Attention Based Mechanism for Image Captioning
Computers & electrical engineering (Comput. Electr. Eng.), 2025
Israa Al Badarneh
Bassam Hammo
Omar Al-Kadi
369
14
0
28 Jan 2025
Advancing General Multimodal Capability of Vision-language Models with Pyramid-descent Visual Position Encoding
Annual Meeting of the Association for Computational Linguistics (ACL), 2025
Ziyang Chen
Mingxiao Li
Zhongfu Chen
Nan Du
Xiaolong Li
Yuexian Zou
365
3
0
19 Jan 2025
DriveLM: Driving with Graph Visual Question Answering
European Conference on Computer Vision (ECCV), 2023
Chonghao Sima
Katrin Renz
Kashyap Chitta
Lawrence Yunliang Chen
Hanxue Zhang
Chengen Xie
Jens Beißwenger
Ping Luo
Andreas Geiger
Hongyang Li
802
355
0
17 Jan 2025
3UR-LLM: An End-to-End Multimodal Large Language Model for 3D Scene Understanding
IEEE transactions on multimedia (TMM), 2025
Haomiao Xiong
Yunzhi Zhuge
Jiawen Zhu
Lu Zhang
Huchuan Lu
238
11
0
14 Jan 2025
Omni-RGPT: Unifying Image and Video Region-level Understanding via Token Marks
Computer Vision and Pattern Recognition (CVPR), 2025
Miran Heo
Min-Hung Chen
De-An Huang
Sifei Liu
Subhashree Radhakrishnan
Seon Joo Kim
Yu-Chun Wang
Ryo Hachiuma
ObjD
VLM
529
9
0
14 Jan 2025
VidChain: Chain-of-Tasks with Metric-based Direct Preference Optimization for Dense Video Captioning
AAAI Conference on Artificial Intelligence (AAAI), 2025
Ji Soo Lee
Jongha Kim
Jeehye Na
Jinyoung Park
H. Kim
VGen
135
7
0
12 Jan 2025
Efficient Architectures for High Resolution Vision-Language Models
International Conference on Computational Linguistics (COLING), 2025
Miguel Carvalho
Bruno Martins
MLLM
VLM
199
1
0
05 Jan 2025
Classifier-Guided Captioning Across Modalities
IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2025
Ariel Shaulov
Tal Shaharabany
E. Shaar
Gal Chechik
Lior Wolf
223
0
0
03 Jan 2025
Unleashing Text-to-Image Diffusion Prior for Zero-Shot Image Captioning
European Conference on Computer Vision (ECCV), 2024
Jianjie Luo
Jingwen Chen
Yehao Li
Yingwei Pan
Jianlin Feng
Hongyang Chao
Ting Yao
DiffM
VLM
287
2
0
03 Jan 2025
Hierarchical Banzhaf Interaction for General Video-Language Representation Learning
IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2024
Peng Jin
Haoyang Li
Li Yuan
Shuicheng Yan
Jie Chen
395
4
0
31 Dec 2024
A Comprehensive Survey of Large Language Models and Multimodal Large Language Models in Medicine
Information Fusion (Inf. Fusion), 2024
Hanguang Xiao
Feizhong Zhou
Xianglong Liu
Tianqi Liu
Zhipeng Li
Xin Liu
Xiaoxuan Huang
AILaw
LM&MA
LRM
449
82
0
31 Dec 2024
Multi-Agent Planning Using Visual Language Models
European Conference on Artificial Intelligence (ECAI), 2024
Michele Brienza
F. Argenziano
Vincenzo Suriani
D. Bloisi
Daniele Nardi
LM&Ro
LLMAG
265
6
0
31 Dec 2024
VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM
Computer Vision and Pattern Recognition (CVPR), 2024
Yuqian Yuan
Hang Zhang
Wentong Li
Zesen Cheng
Boqiang Zhang
...
Deli Zhao
Wenqiao Zhang
Yueting Zhuang
Jianke Zhu
Lidong Bing
422
39
0
31 Dec 2024
From Hallucinations to Facts: Enhancing Language Models with Curated Knowledge Graphs
Ratnesh Kumar Joshi
Sagnik Sengupta
Asif Ekbal
HILM
KELM
228
2
0
24 Dec 2024
SCBench: A Sports Commentary Benchmark for Video LLMs
Kuangzhi Ge
Lawrence Yunliang Chen
Kevin Zhang
Yulin Luo
Tianyu Shi
Liaoyuan Fan
Xiang Li
Guanqun Wang
Shanghang Zhang
230
3
0
23 Dec 2024
Where am I? Cross-View Geo-localization with Natural Language Descriptions
Junyan Ye
Honglin Lin
Leyan Ou
Dairong Chen
Zihao Wang
Bin Wang
Weijia Li
Weijia Li
500
16
0
22 Dec 2024
A High-Quality Text-Rich Image Instruction Tuning Dataset via Hybrid Instruction Generation
International Conference on Computational Linguistics (COLING), 2024
Shijie Zhou
Ruiyi Zhang
Jiuxiang Gu
Changyou Chen
VLM
282
2
0
20 Dec 2024
G-VEval: A Versatile Metric for Evaluating Image and Video Captions Using GPT-4o
AAAI Conference on Artificial Intelligence (AAAI), 2024
Tony Cheng Tong
Sirui He
Z. Shao
Dit-Yan Yeung
276
17
0
18 Dec 2024
Query-centric Audio-Visual Cognition Network for Moment Retrieval, Segmentation and Step-Captioning
AAAI Conference on Artificial Intelligence (AAAI), 2024
Yunbin Tu
Liang-Sheng Li
Li Su
Qingming Huang
298
1
0
18 Dec 2024
Exploring Temporal Event Cues for Dense Video Captioning in Cyclic Co-learning
AAAI Conference on Artificial Intelligence (AAAI), 2024
Zhuyang Xie
Yan Yang
Yankai Yu
Jie Wang
Yongquan Jiang
Xiao-Jun Wu
406
2
0
16 Dec 2024
Learning to Merge Tokens via Decoupled Embedding for Efficient Vision Transformers
Neural Information Processing Systems (NeurIPS), 2024
Dong Hoon Lee
Seunghoon Hong
232
10
0
13 Dec 2024
Automated Image Captioning with CNNs and Transformers
Joshua Adrian Cahyono
Jeremy Nathan Jusuf
VLM
ViT
120
1
0
13 Dec 2024
NowYouSee Me: Context-Aware Automatic Audio Description
IEEE Workshop/Winter Conference on Applications of Computer Vision (WACV), 2024
Seon-Ho Lee
Jue Wang
D. Fan
Zhikang Zhang
Linda Liu
Xiang Hao
Vimal Bhat
Xinyu Li
326
2
0
13 Dec 2024
Neptune: The Long Orbit to Benchmarking Long Video Understanding
Arsha Nagrani
Ruotong Wang
Ramin Mehran
Rachel Hornung
N. B. Gundavarapu
...
Boqing Gong
Cordelia Schmid
Mikhail Sirotenko
Yukun Zhu
Tobias Weyand
445
16
0
12 Dec 2024
Lyra: An Efficient and Speech-Centric Framework for Omni-Cognition
Zhisheng Zhong
Chengyao Wang
Yuqi Liu
Senqiao Yang
Longxiang Tang
...
Shaozuo Yu
Sitong Wu
Eric Lo
Shu Liu
Jiaya Jia
AuLLM
287
18
0
12 Dec 2024
TimeRefine: Temporal Grounding with Time Refining Video LLM
Xizi Wang
Feng Cheng
Ziyang Wang
Huiyu Wang
Md. Mohaiminul Islam
Lorenzo Torresani
Joey Tianyi Zhou
Gedas Bertasius
David J. Crandall
490
6
0
12 Dec 2024
CoMA: Compositional Human Motion Generation with Multi-modal Agents
Shanlin Sun
Gabriel De Araujo
Jiaqi Xu
S. Kevin Zhou
Hanwen Zhang
Ziheng Huang
Chenyu You
Xiaohui Xie
427
13
0
10 Dec 2024
Learning to Correction: Explainable Feedback Generation for Visual Commonsense Reasoning Distractor
ACM Multimedia (MM), 2024
Jiali Chen
Xusen Hei
Yuqi Xue
Yuancheng Wei
Jiayuan Xie
Yi Cai
Qing Li
MLLM
LRM
323
11
0
08 Dec 2024
Who Brings the Frisbee: Probing Hidden Hallucination Factors in Large Vision-Language Model via Causality Analysis
IEEE Workshop/Winter Conference on Applications of Computer Vision (WACV), 2024
Po-Hsuan Huang
Jeng-Lin Li
Chin-Po Chen
Ming-Ching Chang
Wei-Chao Chen
LRM
297
4
0
04 Dec 2024
Video LLMs for Temporal Reasoning in Long Videos
Fawad Javed Fateh
Umer Ahmed
Hamza Khan
M. Zia
Quoc-Huy Tran
VLM
658
6
0
04 Dec 2024
DIR: Retrieval-Augmented Image Captioning with Comprehensive Understanding
Hao Wu
Zhihang Zhong
Xiao Sun
DiffM
305
1
0
02 Dec 2024
OmniFlow: Any-to-Any Generation with Multi-Modal Rectified Flows
Computer Vision and Pattern Recognition (CVPR), 2024
Shufan Li
Konstantinos Kallidromitis
Akash Gokul
Zichun Liao
Yusuke Kato
Kazuki Kozuka
Aditya Grover
VGen
451
25
0
02 Dec 2024
LSceneLLM: Enhancing Large 3D Scene Understanding Using Adaptive Visual Preferences
Computer Vision and Pattern Recognition (CVPR), 2024
Hongyan Zhi
Peihao Chen
Junyan Li
Shuailei Ma
Xinyu Sun
Tianhang Xiang
Yinjie Lei
Mingkui Tan
Chuang Gan
432
25
0
02 Dec 2024
DOGR: Towards Versatile Visual Document Grounding and Referring
Yinan Zhou
Yuxin Chen
Haokun Lin
Shuyu Yang
Li Zhu
Chen Ma
Chen Ma
Mingyu Ding
Ying Shan
ObjD
553
4
0
26 Nov 2024
Diagram-Driven Course Questions Generation
Xinyu Zhang
L. Zhang
Yanrui Wu
Muye Huang
Wenjun Wu
Bo Li
Shaowei Wang
Jun Liu
Jun Liu
429
0
0
26 Nov 2024
TechCoach: Towards Technical-Point-Aware Descriptive Action Coaching
Yuan-Ming Li
An-Lan Wang
Kun-Yu Lin
Yu-Ming Tang
Ling-an Zeng
Jian-Fang Hu
Wei-Shi Zheng
542
6
0
26 Nov 2024
VideoOrion: Tokenizing Object Dynamics in Videos
Yicheng Feng
Yijiang Li
Wanpeng Zhang
Sipeng Zheng
Zongqing Lu
Sipeng Zheng
Zongqing Lu
406
7
0
25 Nov 2024
IterIS: Iterative Inference-Solving Alignment for LoRA Merging
Computer Vision and Pattern Recognition (CVPR), 2024
Hongxu Chen
Runshi Li
Bowei Zhu
Zhen Wang
Long Chen
MoMe
432
2
0
21 Nov 2024
LaVida Drive: Vision-Text Interaction VLM for Autonomous Driving with Token Selection, Recovery and Enhancement
Siwen Jiao
Yangyi Fang
Baoyun Peng
Wangqun Chen
Bharadwaj Veeravalli
470
11
0
20 Nov 2024
The Power of Many: Multi-Agent Multimodal Models for Cultural Image Captioning
North American Chapter of the Association for Computational Linguistics (NAACL), 2024
Longju Bai
Angana Borah
Oana Ignat
Amélie Reymond
VLM
321
6
0
18 Nov 2024
SymDPO: Boosting In-Context Learning of Large Multimodal Models with Symbol Demonstration Direct Preference Optimization
Computer Vision and Pattern Recognition (CVPR), 2024
Hongrui Jia
Chaoya Jiang
Haiyang Xu
Wei Ye
Mengfan Dong
Ming Yan
Ji Zhang
Fei Huang
Shikun Zhang
MLLM
392
7
0
17 Nov 2024
Unstructured Text Enhanced Open-domain Dialogue System: A Systematic Survey
Longxuan Ma
Mingda Li
Weinan Zhang
Jiapeng Li
Ting Liu
349
19
0
14 Nov 2024
Which Viewpoint Shows it Best? Language for Weakly Supervising View Selection in Multi-view Instructional Videos
Computer Vision and Pattern Recognition (CVPR), 2024
Sagnik Majumder
Tushar Nagarajan
Ziad Al-Halah
Reina Pradhan
Kristen Grauman
424
0
0
13 Nov 2024
Grounded Video Caption Generation
Evangelos Kazakos
Cordelia Schmid
Josef Sivic
270
0
0
12 Nov 2024
Multi-Modal interpretable automatic video captioning
Antoine Hanna-Asaad
Decky Aspandi
Titus Zaharia
255
1
0
11 Nov 2024
StoryTeller: Improving Long Video Description through Global Audio-Visual Character Identification
Yichen He
Yuan Lin
Jianchao Wu
Hanchong Zhang
Yuchen Zhang
Ruicheng Le
VGen
VLM
782
5
0
11 Nov 2024
EVQAScore: A Fine-grained Metric for Video Question Answering Data Quality Evaluation
Hao Liang
Zirong Chen
Feiyu Xiong
Wentao Zhang
312
0
0
11 Nov 2024
ViTOC: Vision Transformer and Object-aware Captioner
Feiyang Huang
391
2
0
09 Nov 2024
Previous
1
2
3
...
6
7
8
...
46
47
48
Next