Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
2003.07758
Cited By
v1
v2 (latest)
Multi-modal Dense Video Captioning
17 March 2020
Vladimir E. Iashin
Esa Rahtu
Re-assign community
ArXiv (abs)
PDF
HTML
Github (143★)
Papers citing
"Multi-modal Dense Video Captioning"
50 / 101 papers shown
Enrich and Detect: Video Temporal Grounding with Multimodal LLMs
Shraman Pramanick
E. Mavroudi
Yale Song
Rama Chellappa
Lorenzo Torresani
Triantafyllos Afouras
277
3
0
19 Oct 2025
Shot2Tactic-Caption: Multi-Scale Captioning of Badminton Videos for Tactical Understanding
Ning Ding
Keisuke Fujii
Toru Tamaki
149
1
0
16 Oct 2025
Image-to-Video Transfer Learning based on Image-Language Foundation Models: A Comprehensive Survey
Jinxuan Li
Chaolei Tan
Haoxuan Chen
Jianxin Ma
Jian-Fang Hu
Wei-Shi Zheng
Jianhuang Lai
VLM
250
1
0
12 Oct 2025
MomentSeg: Moment-Centric Sampling for Enhanced Video Pixel Understanding
Ming Dai
Sen Yang
Boqiang Duan
Wankou Yang
Jingdong Wang
VOS
366
0
0
10 Oct 2025
Sali4Vid: Saliency-Aware Video Reweighting and Adaptive Caption Retrieval for Dense Video Captioning
MinJu Jeon
Si-Woo Kim
Ye-Chan Kim
HyunGee Kim
Dong-Jin Kim
VGen
193
3
0
04 Sep 2025
A Survey on Video Temporal Grounding with Multimodal Large Language Model
IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2025
Yue Yu
Wei Liu
Y. Liu
Meng-yang Liu
Liqiang Nie
Zhouchen Lin
C. Chen
AI4TS
VLM
LRM
175
13
0
07 Aug 2025
Toward Scalable Video Narration: A Training-free Approach Using Multimodal Large Language Models
Tz-Ying Wu
Tahani Trigui
S. N. Sridhar
Anand Bodas
Subarna Tripathi
133
2
0
22 Jul 2025
ESG-Net: Event-Aware Semantic Guided Network for Dense Audio-Visual Event Localization
Huilai Li
Yonghao Dang
Ying Xing
Yiming Wang
Jianqin Yin
239
0
0
14 Jul 2025
PR-DETR: Injecting Position and Relation Prior for Dense Video Captioning
Yizhe Li
Sanping Zhou
Zheng Qin
Le Wang
ViT
261
0
0
19 Jun 2025
ARGUS: Hallucination and Omission Evaluation in Video-LLMs
Ruchit Rawal
Reza Shirkavand
Heng-Chiao Huang
Gowthami Somepalli
Tom Goldstein
381
7
0
09 Jun 2025
ViSMaP: Unsupervised Hour-long Video Summarisation by Meta-Prompting
Jian Hu
Dimitrios Korkinof
S. Gong
Mariano Beguerisse-Díaz
VLM
265
0
0
22 Apr 2025
Caption Anything in Video: Fine-grained Object-centric Captioning via Spatiotemporal Multimodal Prompting
Yunlong Tang
Jing Bi
Chao Huang
Susan Liang
Daiki Shimada
...
Jinxi He
Liu He
Zeliang Zhang
Jiebo Luo
Chenliang Xu
370
10
0
07 Apr 2025
Shot-by-Shot: Film-Grammar-Aware Training-Free Audio Description Generation
Junyu Xie
Tengda Han
Max Bain
Arsha Nagrani
Eshika Khandelwal
Gül Varol
Weidi Xie
Andrew Zisserman
DiffM
VGen
490
6
0
01 Apr 2025
Fair Dynamic Spectrum Access via Fully Decentralized Multi-Agent Reinforcement Learning
International Symposium on Modeling and Optimization in Mobile, Ad-Hoc and Wireless Networks (WiOpt), 2025
Yubo Zhang
Pedro Botelho
Trevor Gordon
Gil Zussman
I. Kadota
340
1
0
31 Mar 2025
LION-FS: Fast & Slow Video-Language Thinker as Online Video Assistant
Computer Vision and Pattern Recognition (CVPR), 2025
Wei Li
Bing Hu
Rui Shao
Leyang Shen
Liqiang Nie
375
45
0
05 Mar 2025
Exploring Temporal Event Cues for Dense Video Captioning in Cyclic Co-learning
AAAI Conference on Artificial Intelligence (AAAI), 2024
Zhuyang Xie
Yan Yang
Yankai Yu
Jie Wang
Yongquan Jiang
Xiao-Jun Wu
526
5
0
16 Dec 2024
NowYouSee Me: Context-Aware Automatic Audio Description
IEEE Workshop/Winter Conference on Applications of Computer Vision (WACV), 2024
Seon-Ho Lee
Jue Wang
D. Fan
Zhikang Zhang
Linda Liu
Xiang Hao
Vimal Bhat
Xinyu Li
353
2
0
13 Dec 2024
Multi-Modal interpretable automatic video captioning
Antoine Hanna-Asaad
Decky Aspandi
Titus Zaharia
282
1
0
11 Nov 2024
SaSR-Net: Source-Aware Semantic Representation Network for Enhancing Audio-Visual Question Answering
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024
Tianyu Yang
Yiyang Nan
Lisen Dai
Zhenwen Liang
Yapeng Tian
Wei Wei
405
2
0
07 Nov 2024
GEM-VPC: A dual Graph-Enhanced Multimodal integration for Video Paragraph Captioning
Eileen Wang
Caren Han
Josiah Poon
272
1
0
12 Oct 2024
Investigating Representation Universality: Case Study on Genealogical Representations
David D. Baek
Yuxiao Li
Max Tegmark
356
3
0
10 Oct 2024
Dissecting Temporal Understanding in Text-to-Audio Retrieval
ACM Multimedia (MM), 2024
Andreea-Maria Oncescu
João F. Henriques
A. Sophia Koepke
372
5
0
01 Sep 2024
Attend-Fusion: Efficient Audio-Visual Fusion for Video Classification
Mahrukh Awan
Asmar Nadeem
Muhammad Junaid Awan
Armin Mustafa
Syed Sameed Husain
435
5
0
26 Aug 2024
AVCap: Leveraging Audio-Visual Features as Text Tokens for Captioning
Jongsuk Kim
Jiwon Shin
Junmo Kim
511
7
0
10 Jul 2024
Live Video Captioning
Eduardo Blanco-Fernández
Carlos Gutiérrez-Álvarez
Nadia Nasri
Saturnino Maldonado-Bascón
Roberto J. López-Sastre
355
3
0
20 Jun 2024
NarrativeBridge: Enhancing Video Captioning with Causal-Temporal Narrative
International Conference on Learning Representations (ICLR), 2024
Asmar Nadeem
Faegheh Sardari
R. Dawes
Syed Sameed Husain
Adrian Hilton
Armin Mustafa
513
10
0
10 Jun 2024
From CNNs to Transformers in Multimodal Human Action Recognition: A Survey
Muhammad Bilal Shaikh
Syed Mohammed Shamsul Islam
Douglas Chai
Naveed Akhtar
439
37
0
22 May 2024
SoccerNet-Echoes: A Soccer Game Audio Commentary Dataset
Sushant Gautam
Mehdi Houshmand Sarkhoosh
Jan Held
Cise Midoglu
A. Cioppa
Silvio Giancola
Vajira Thambawita
Michael A. Riegler
Pål Halvorsen
Mubarak Shah
281
12
0
12 May 2024
AutoAD III: The Prequel -- Back to the Pixels
Tengda Han
Max Bain
Arsha Nagrani
Gül Varol
Weidi Xie
Andrew Zisserman
VGen
DiffM
445
40
0
22 Apr 2024
Enhancing Traffic Safety with Parallel Dense Video Captioning for End-to-End Event Analysis
Maged Shoman
Dongdong Wang
Armstrong Aboah
Mohamed Abdel-Aty
239
20
0
12 Apr 2024
Do You Remember? Dense Video Captioning with Cross-Modal Memory Retrieval
Minkuk Kim
Hyeon Bae Kim
Jinyoung Moon
Jinwoo Choi
Seong Tae Kim
218
48
0
11 Apr 2024
DIBS: Enhancing Dense Video Captioning with Unlabeled Videos via Pseudo Boundary Enrichment and Online Refinement
Computer Vision and Pattern Recognition (CVPR), 2024
Hao Wu
Huabin Liu
Yu Qiao
Xiao Sun
3DV
142
21
0
03 Apr 2024
Streaming Dense Video Captioning
Xingyi Zhou
Anurag Arnab
Shyamal Buch
Shen Yan
Austin Myers
Xuehan Xiong
Arsha Nagrani
Cordelia Schmid
VLM
292
87
0
01 Apr 2024
Towards Multimodal Video Paragraph Captioning Models Robust to Missing Modality
Sishuo Chen
Lei Li
Shuhuai Ren
Rundong Gao
Yuanxin Liu
Xiaohan Bi
Xu Sun
Lu Hou
272
3
0
28 Mar 2024
OmniVid: A Generative Framework for Universal Video Understanding
Junke Wang
Dongdong Chen
Chong Luo
Bo He
Lu Yuan
Zuxuan Wu
Yu-Gang Jiang
VLM
VGen
344
37
0
26 Mar 2024
A Comprehensive Survey of 3D Dense Captioning: Localizing and Describing Objects in 3D Scenes
Ting Yu
Xiaojun Lin
Shuhui Wang
Weiguo Sheng
Qingming Huang
Jun-chen Yu
3DV
254
19
0
12 Mar 2024
VidProM: A Million-scale Real Prompt-Gallery Dataset for Text-to-Video Diffusion Models
Neural Information Processing Systems (NeurIPS), 2024
Wenhao Wang
Yi Yang
VGen
DiffM
542
91
0
10 Mar 2024
ADL4D: Towards A Contextually Rich Dataset for 4D Activities of Daily Living
Marsil Zakour
Partha Partim Nath
Ludwig Lohmer
Emre Faik Gökçe
Martin Piccolrovazzi
Constantin Patsch
Yuankai Wu
Rahul P. Chaudhari
Eckehard G. Steinbach
276
2
0
27 Feb 2024
Multi-modal News Understanding with Professionally Labelled Videos (ReutersViLNews)
Shih-Han Chou
Matthew Kowal
Yasmin Niknam
Diana Moyano
Shayaan Mehdi
...
Cheng Zhang
Ian Knopke
S. Kocak
Leonid Sigal
Yalda Mohsenzadeh
390
2
0
23 Jan 2024
FunnyNet-W: Multimodal Learning of Funny Moments in Videos in the Wild
International Journal of Computer Vision (IJCV), 2024
Zhi-Song Liu
Robin Courant
Vicky Kalogeiton
404
11
0
08 Jan 2024
Context-Guided Spatio-Temporal Video Grounding
Computer Vision and Pattern Recognition (CVPR), 2024
Xin Gu
Hengrui Fan
Yan Huang
Tiejian Luo
Libo Zhang
388
43
0
03 Jan 2024
Leveraging Generative Language Models for Weakly Supervised Sentence Component Analysis in Video-Language Joint Learning
Zaber Ibn Abdul Hakim
Najibul Haque Sarker
Rahul Pratap Singh
Bishmoy Paul
Ali Dabouei
Min Xu
385
1
0
10 Dec 2023
CLearViD: Curriculum Learning for Video Description
Cheng-Yu Chuang
Pooyan Fazli
238
1
0
08 Nov 2023
Dense Video Captioning: A Survey of Techniques, Datasets and Evaluation Protocols
ACM Computing Surveys (ACM Comput. Surv.), 2023
Iqra Qasim
Alexander Horsch
Dilip K. Prasad
291
20
0
05 Nov 2023
From Image to Language: A Critical Analysis of Visual Question Answering (VQA) Approaches, Challenges, and Opportunities
Information Fusion (Inf. Fusion), 2023
Md Farhan Ishmam
Md Sakib Hossain Shovon
M. F. Mridha
Nilanjan Dey
440
82
0
01 Nov 2023
CAD -- Contextual Multi-modal Alignment for Dynamic AVQA
IEEE Workshop/Winter Conference on Applications of Computer Vision (WACV), 2023
Asmar Nadeem
Adrian Hilton
R. Dawes
Graham A. Thomas
A. Mustafa
384
14
0
25 Oct 2023
VidCoM: Fast Video Comprehension through Large Language Models with Multimodal Tools
Huihui Gong
Minjing Dong
Siqi Ma
S. Çamtepe
Chang Xu
Lei Hou
Surya Nepal
VLM
MLLM
325
0
0
16 Oct 2023
AutoAD II: The Sequel -- Who, When, and What in Movie Audio Description
IEEE International Conference on Computer Vision (ICCV), 2023
Tengda Han
Max Bain
Arsha Nagrani
Gül Varol
Weidi Xie
Andrew Zisserman
VGen
DiffM
323
54
0
10 Oct 2023
A Hierarchical Graph-based Approach for Recognition and Description Generation of Bimanual Actions in Videos
Fatemeh Ziaeetabar
Reza Safabakhsh
S. Momtazi
M. Tamosiunaite
Florentin Wörgötter
316
8
0
01 Oct 2023
Skip-Plan: Procedure Planning in Instructional Videos via Condensed Action Space Learning
IEEE International Conference on Computer Vision (ICCV), 2023
Zhiheng Li
Wenjia Geng
Muheng Li
Lei Chen
Yansong Tang
Jiwen Lu
Jie Zhou
226
16
0
01 Oct 2023
1
2
3
Next
Page 1 of 3