ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2201.08264
  4. Cited By
End-to-end Generative Pretraining for Multimodal Video Captioning

End-to-end Generative Pretraining for Multimodal Video Captioning

20 January 2022
Paul Hongsuck Seo
Arsha Nagrani
Anurag Arnab
Cordelia Schmid
ArXivPDFHTML

Papers citing "End-to-end Generative Pretraining for Multimodal Video Captioning"

50 / 104 papers shown
Title
FocusedAD: Character-centric Movie Audio Description
FocusedAD: Character-centric Movie Audio Description
Xiaojun Ye
C. Wang
Yiren Song
Sheng Zhou
Liangcheng Li
Jiajun Bu
VGen
51
0
0
16 Apr 2025
Extending Visual Dynamics for Video-to-Music Generation
Extending Visual Dynamics for Video-to-Music Generation
Xiaohao Liu
Teng Tu
Yunshan Ma
Tat-Seng Chua
VGen
59
0
0
10 Apr 2025
Chapter-Llama: Efficient Chaptering in Hour-Long Videos with LLMs
Chapter-Llama: Efficient Chaptering in Hour-Long Videos with LLMs
Lucas Ventura
Antoine Yang
Cordelia Schmid
Gül Varol
28
0
0
31 Mar 2025
Learning to Generate Long-term Future Narrations Describing Activities of Daily Living
Ramanathan Rajendiran
Debaditya Roy
Basura Fernando
VGen
41
0
0
03 Mar 2025
Parameter-free Video Segmentation for Vision and Language Understanding
Louis Mahon
Mirella Lapata
VLM
35
1
0
03 Mar 2025
Fine-Grained Video Captioning through Scene Graph Consolidation
Fine-Grained Video Captioning through Scene Graph Consolidation
Sanghyeok Chu
Seonguk Seo
Bohyung Han
48
1
0
23 Feb 2025
Hierarchical Banzhaf Interaction for General Video-Language Representation Learning
Hierarchical Banzhaf Interaction for General Video-Language Representation Learning
Peng Jin
H. Li
Li Yuan
Shuicheng Yan
Jie Chen
45
1
0
31 Dec 2024
Explainable and Interpretable Multimodal Large Language Models: A
  Comprehensive Survey
Explainable and Interpretable Multimodal Large Language Models: A Comprehensive Survey
Yunkai Dang
Kaichen Huang
Jiahao Huo
Yibo Yan
S. Huang
...
Kun Wang
Yong Liu
Jing Shao
Hui Xiong
Xuming Hu
LRM
96
14
0
03 Dec 2024
HyperGLM: HyperGraph for Video Scene Graph Generation and Anticipation
HyperGLM: HyperGraph for Video Scene Graph Generation and Anticipation
Trong-Thuan Nguyen
Pha Nguyen
J. Cothren
Alper Yilmaz
Khoa Luu
80
1
0
27 Nov 2024
TechCoach: Towards Technical-Point-Aware Descriptive Action Coaching
TechCoach: Towards Technical-Point-Aware Descriptive Action Coaching
Yuan-Ming Li
An-Lan Wang
Kun-Yu Lin
Yu-Ming Tang
Ling-an Zeng
Jian-Fang Hu
Wei-Shi Zheng
93
6
0
26 Nov 2024
Masked Differential Privacy
Masked Differential Privacy
David Schneider
Sina Sajadmanesh
Vikash Sehwag
Saquib Sarfraz
Rainer Stiefelhagen
Lingjuan Lyu
Vivek Sharma
28
1
0
22 Oct 2024
It's Just Another Day: Unique Video Captioning by Discriminative
  Prompting
It's Just Another Day: Unique Video Captioning by Discriminative Prompting
Toby Perrett
Tengda Han
Dima Damen
Andrew Zisserman
19
3
0
15 Oct 2024
GEM-VPC: A dual Graph-Enhanced Multimodal integration for Video
  Paragraph Captioning
GEM-VPC: A dual Graph-Enhanced Multimodal integration for Video Paragraph Captioning
Eileen Wang
Caren Han
Josiah Poon
19
0
0
12 Oct 2024
Ego-VPA: Egocentric Video Understanding with Parameter-efficient Adaptation
Ego-VPA: Egocentric Video Understanding with Parameter-efficient Adaptation
Tz-Ying Wu
Kyle Min
Subarna Tripathi
Nuno Vasconcelos
EgoV
53
0
0
28 Jul 2024
AutoAD-Zero: A Training-Free Framework for Zero-Shot Audio Description
AutoAD-Zero: A Training-Free Framework for Zero-Shot Audio Description
Junyu Xie
Tengda Han
Max Bain
Arsha Nagrani
Gül Varol
Weidi Xie
Andrew Zisserman
VGen
25
8
0
22 Jul 2024
Meta-optimized Angular Margin Contrastive Framework for Video-Language
  Representation Learning
Meta-optimized Angular Margin Contrastive Framework for Video-Language Representation Learning
Thong Nguyen
Yi Bin
Xiaobao Wu
Xinshuai Dong
Zhiyuan Hu
Khoi M. Le
Cong-Duy Nguyen
See-Kiong Ng
Luu Anh Tuan
37
5
0
04 Jul 2024
Video Watermarking: Safeguarding Your Video from (Unauthorized)
  Annotations by Video-based LLMs
Video Watermarking: Safeguarding Your Video from (Unauthorized) Annotations by Video-based LLMs
Jinmin Li
Kuofeng Gao
Yang Bai
Jingyun Zhang
Shu-Tao Xia
28
4
0
02 Jul 2024
Enhancing Video-Language Representations with Structural Spatio-Temporal
  Alignment
Enhancing Video-Language Representations with Structural Spatio-Temporal Alignment
Hao Fei
Shengqiong Wu
Meishan Zhang
M. Zhang
Tat-Seng Chua
Shuicheng Yan
AI4TS
34
37
0
27 Jun 2024
Towards Holistic Language-video Representation: the language
  model-enhanced MSR-Video to Text Dataset
Towards Holistic Language-video Representation: the language model-enhanced MSR-Video to Text Dataset
Yuchen Yang
Yingxuan Duan
VGen
23
0
0
19 Jun 2024
GUI Action Narrator: Where and When Did That Action Take Place?
GUI Action Narrator: Where and When Did That Action Take Place?
Qinchen Wu
Difei Gao
Kevin Qinghong Lin
Zhuoyu Wu
Xiangwu Guo
Peiran Li
Weichen Zhang
Hengxu Wang
Mike Zheng Shou
29
3
0
19 Jun 2024
Labeling Comic Mischief Content in Online Videos with a Multimodal
  Hierarchical-Cross-Attention Model
Labeling Comic Mischief Content in Online Videos with a Multimodal Hierarchical-Cross-Attention Model
Elaheh Baharlouei
Mahsa Shafaei
Yigeng Zhang
Hugo Jair Escalante
Thamar Solorio
31
0
0
12 Jun 2024
MICap: A Unified Model for Identity-aware Movie Descriptions
MICap: A Unified Model for Identity-aware Movie Descriptions
Haran Raajesh
Naveen Reddy Desanur
Zeeshan Khan
Makarand Tapaswi
23
4
0
19 May 2024
SoccerNet-Echoes: A Soccer Game Audio Commentary Dataset
SoccerNet-Echoes: A Soccer Game Audio Commentary Dataset
Sushant Gautam
Mehdi Houshmand Sarkhoosh
Jan Held
Cise Midoglu
A. Cioppa
Silvio Giancola
Vajira Thambawita
Michael A. Riegler
P. Halvorsen
Mubarak Shah
21
4
0
12 May 2024
Learning text-to-video retrieval from image captioning
Learning text-to-video retrieval from image captioning
Lucas Ventura
Cordelia Schmid
Gül Varol
3DV
31
3
0
26 Apr 2024
AutoAD III: The Prequel -- Back to the Pixels
AutoAD III: The Prequel -- Back to the Pixels
Tengda Han
Max Bain
Arsha Nagrani
Gül Varol
Weidi Xie
Andrew Zisserman
VGen
DiffM
36
20
0
22 Apr 2024
Do You Remember? Dense Video Captioning with Cross-Modal Memory
  Retrieval
Do You Remember? Dense Video Captioning with Cross-Modal Memory Retrieval
Minkuk Kim
Hyeon Bae Kim
Jinyoung Moon
Jinwoo Choi
Seong Tae Kim
32
16
0
11 Apr 2024
MoReVQA: Exploring Modular Reasoning Models for Video Question Answering
MoReVQA: Exploring Modular Reasoning Models for Video Question Answering
Juhong Min
Shyamal Buch
Arsha Nagrani
Minsu Cho
Cordelia Schmid
LRM
34
20
0
09 Apr 2024
FMM-Attack: A Flow-based Multi-modal Adversarial Attack on Video-based
  LLMs
FMM-Attack: A Flow-based Multi-modal Adversarial Attack on Video-based LLMs
Jinmin Li
Kuofeng Gao
Yang Bai
Jingyun Zhang
Shu-Tao Xia
Yisen Wang
AAML
22
7
0
20 Mar 2024
A Comprehensive Survey of 3D Dense Captioning: Localizing and Describing
  Objects in 3D Scenes
A Comprehensive Survey of 3D Dense Captioning: Localizing and Describing Objects in 3D Scenes
Ting Yu
Xiaojun Lin
Shuhui Wang
Weiguo Sheng
Qingming Huang
Jun-chen Yu
3DV
30
10
0
12 Mar 2024
Beyond MOT: Semantic Multi-Object Tracking
Beyond MOT: Semantic Multi-Object Tracking
Yunhao Li
Hao Wang
Xue Ma
Jiali Yao
Shaohua Dong
Heng Fan
Libo Zhang
VOT
21
3
0
08 Mar 2024
Tri-Modal Motion Retrieval by Learning a Joint Embedding Space
Tri-Modal Motion Retrieval by Learning a Joint Embedding Space
Kangning Yin
Shihao Zou
Yuxuan Ge
Zheng Tian
27
5
0
01 Mar 2024
Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers
Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers
Tsai-Shien Chen
Aliaksandr Siarohin
Willi Menapace
Ekaterina Deyneka
Hsiang-wei Chao
...
Yuwei Fang
Hsin-Ying Lee
Jian Ren
Ming-Hsuan Yang
Sergey Tulyakov
VGen
70
177
0
29 Feb 2024
Video ReCap: Recursive Captioning of Hour-Long Videos
Video ReCap: Recursive Captioning of Hour-Long Videos
Md. Mohaiminul Islam
Ngan Ho
Xitong Yang
Tushar Nagarajan
Lorenzo Torresani
Gedas Bertasius
VGen
VLM
16
44
0
20 Feb 2024
Multi-modal News Understanding with Professionally Labelled Videos
  (ReutersViLNews)
Multi-modal News Understanding with Professionally Labelled Videos (ReutersViLNews)
Shih-Han Chou
Matthew Kowal
Yasmin Niknam
Diana Moyano
Shayaan Mehdi
...
Cheng Zhang
Ian Knopke
S. Kocak
Leonid Sigal
Yalda Mohsenzadeh
19
1
0
23 Jan 2024
SnapCap: Efficient Snapshot Compressive Video Captioning
SnapCap: Efficient Snapshot Compressive Video Captioning
Jianqiao Sun
Yudi Su
Hao Zhang
Ziheng Cheng
Zequn Zeng
Zhengjue Wang
Bo Chen
Xin Yuan
22
1
0
10 Jan 2024
Context-Guided Spatio-Temporal Video Grounding
Context-Guided Spatio-Temporal Video Grounding
Xin Gu
Hengrui Fan
Yan Huang
Tiejian Luo
Libo Zhang
26
13
0
03 Jan 2024
Subject-Oriented Video Captioning
Subject-Oriented Video Captioning
Yunchuan Ma
Chang Teng
Yuankai Qi
Guorong Li
Laiyun Qing
Qi Wu
Qingming Huang
22
0
0
20 Dec 2023
Video Summarization: Towards Entity-Aware Captions
Video Summarization: Towards Entity-Aware Captions
Hammad A. Ayyubi
Tianqi Liu
Arsha Nagrani
Xudong Lin
Mingda Zhang
Anurag Arnab
Feng Han
Yukun Zhu
Jialu Liu
Shih-Fu Chang
26
1
0
01 Dec 2023
RTQ: Rethinking Video-language Understanding Based on Image-text Model
RTQ: Rethinking Video-language Understanding Based on Image-text Model
Xiao Wang
Yaoyu Li
Tian Gan
Zheng Zhang
Jingjing Lv
Liqiang Nie
11
6
0
01 Dec 2023
CoDi-2: In-Context, Interleaved, and Interactive Any-to-Any Generation
CoDi-2: In-Context, Interleaved, and Interactive Any-to-Any Generation
Zineng Tang
Ziyi Yang
Mahmoud Khademi
Yang Liu
Chenguang Zhu
Mohit Bansal
LRM
MLLM
AuLLM
52
44
0
30 Nov 2023
ViLMA: A Zero-Shot Benchmark for Linguistic and Temporal Grounding in
  Video-Language Models
ViLMA: A Zero-Shot Benchmark for Linguistic and Temporal Grounding in Video-Language Models
.Ilker Kesen
Andrea Pedrotti
Mustafa Dogan
Michele Cafagna
Emre Can Acikgoz
...
Iacer Calixto
Anette Frank
Albert Gatt
Aykut Erdem
Erkut Erdem
33
15
0
13 Nov 2023
CAD -- Contextual Multi-modal Alignment for Dynamic AVQA
CAD -- Contextual Multi-modal Alignment for Dynamic AVQA
Asmar Nadeem
Adrian Hilton
R. Dawes
Graham A. Thomas
A. Mustafa
16
9
0
25 Oct 2023
Uncovering Hidden Connections: Iterative Search and Reasoning for Video-grounded Dialog
Uncovering Hidden Connections: Iterative Search and Reasoning for Video-grounded Dialog
Haoyu Zhang
Meng Liu
Yaowei Wang
Da Cao
Weili Guan
Liqiang Nie
28
0
0
11 Oct 2023
AutoAD II: The Sequel -- Who, When, and What in Movie Audio Description
AutoAD II: The Sequel -- Who, When, and What in Movie Audio Description
Tengda Han
Max Bain
Arsha Nagrani
Gül Varol
Weidi Xie
Andrew Zisserman
VGen
DiffM
19
36
0
10 Oct 2023
HowToCaption: Prompting LLMs to Transform Video Annotations at Scale
HowToCaption: Prompting LLMs to Transform Video Annotations at Scale
Nina Shvetsova
Anna Kukleva
Xudong Hong
Christian Rupprecht
Bernt Schiele
Hilde Kuehne
32
25
0
07 Oct 2023
A Hierarchical Graph-based Approach for Recognition and Description
  Generation of Bimanual Actions in Videos
A Hierarchical Graph-based Approach for Recognition and Description Generation of Bimanual Actions in Videos
Fatemeh Ziaeetabar
Reza Safabakhsh
S. Momtazi
M. Tamosiunaite
F. Worgotter
17
1
0
01 Oct 2023
VidChapters-7M: Video Chapters at Scale
VidChapters-7M: Video Chapters at Scale
Antoine Yang
Arsha Nagrani
Ivan Laptev
Josef Sivic
Cordelia Schmid
VGen
13
26
0
25 Sep 2023
Accurate and Fast Compressed Video Captioning
Accurate and Fast Compressed Video Captioning
Yaojie Shen
Xin Gu
Kai Xu
Hengrui Fan
Longyin Wen
Libo Zhang
ViT
18
26
0
22 Sep 2023
Collaborative Three-Stream Transformers for Video Captioning
Collaborative Three-Stream Transformers for Video Captioning
Hao Wang
Libo Zhang
Hengrui Fan
Tiejian Luo
21
6
0
18 Sep 2023
MultiCapCLIP: Auto-Encoding Prompts for Zero-Shot Multilingual Visual
  Captioning
MultiCapCLIP: Auto-Encoding Prompts for Zero-Shot Multilingual Visual Captioning
Bang-ju Yang
Fenglin Liu
X. Wu
Yaowei Wang
Xu Sun
Yuexian Zou
VLM
CLIP
22
13
0
25 Aug 2023
123
Next