ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2201.08264
  4. Cited By
End-to-end Generative Pretraining for Multimodal Video Captioning

End-to-end Generative Pretraining for Multimodal Video Captioning

20 January 2022
Paul Hongsuck Seo
Arsha Nagrani
Anurag Arnab
Cordelia Schmid
ArXivPDFHTML

Papers citing "End-to-end Generative Pretraining for Multimodal Video Captioning"

50 / 104 papers shown
Title
ViCo: Engaging Video Comment Generation with Human Preference Rewards
ViCo: Engaging Video Comment Generation with Human Preference Rewards
Yuchong Sun
Bei Liu
Xu Chen
Ruihua Song
Jianlong Fu
VGen
20
2
0
22 Aug 2023
UnIVAL: Unified Model for Image, Video, Audio and Language Tasks
UnIVAL: Unified Model for Image, Video, Audio and Language Tasks
Mustafa Shukor
Corentin Dancette
Alexandre Ramé
Matthieu Cord
MoMe
MLLM
27
42
0
30 Jul 2023
One-Versus-Others Attention: Scalable Multimodal Integration for
  Clinical Data
One-Versus-Others Attention: Scalable Multimodal Integration for Clinical Data
Michal Golovanevsky
Eva Schiller
Akira Nair
Ritambhara Singh
Carsten Eickhoff
14
2
0
11 Jul 2023
Exploring the Role of Audio in Video Captioning
Exploring the Role of Audio in Video Captioning
Yuhan Shen
Linjie Yang
Longyin Wen
Haichao Yu
Ehsan Elhamifar
Heng Wang
13
2
0
21 Jun 2023
Retrieving-to-Answer: Zero-Shot Video Question Answering with Frozen
  Large Language Models
Retrieving-to-Answer: Zero-Shot Video Question Answering with Frozen Large Language Models
Junting Pan
Ziyi Lin
Yuying Ge
Xiatian Zhu
Renrui Zhang
Yi Wang
Yu Qiao
Hongsheng Li
MLLM
24
26
0
15 Jun 2023
PromptStyle: Controllable Style Transfer for Text-to-Speech with Natural
  Language Descriptions
PromptStyle: Controllable Style Transfer for Text-to-Speech with Natural Language Descriptions
Guanghou Liu
Yongmao Zhang
Yinjiao Lei
Yunlin Chen
Rui Wang
Zhifei Li
Linfu Xie
8
36
0
31 May 2023
PaLI-X: On Scaling up a Multilingual Vision and Language Model
PaLI-X: On Scaling up a Multilingual Vision and Language Model
Xi Chen
Josip Djolonga
Piotr Padlewski
Basil Mustafa
Soravit Changpinyo
...
Mojtaba Seyedhosseini
A. Angelova
Xiaohua Zhai
N. Houlsby
Radu Soricut
VLM
44
186
0
29 May 2023
VLAB: Enhancing Video Language Pre-training by Feature Adapting and
  Blending
VLAB: Enhancing Video Language Pre-training by Feature Adapting and Blending
Xingjian He
Sihan Chen
Fan Ma
Zhicheng Huang
Xiaojie Jin
Zikang Liu
Dongmei Fu
Yi Yang
J. Liu
Jiashi Feng
VLM
CLIP
18
17
0
22 May 2023
Gloss-Free End-to-End Sign Language Translation
Gloss-Free End-to-End Sign Language Translation
Kezhou Lin
Xiaohan Wang
Linchao Zhu
Ke Sun
Bang Zhang
Yezhou Yang
SLR
20
17
0
22 May 2023
InstructVid2Vid: Controllable Video Editing with Natural Language
  Instructions
InstructVid2Vid: Controllable Video Editing with Natural Language Instructions
Bosheng Qin
Juncheng Li
Siliang Tang
Tat-Seng Chua
Yueting Zhuang
VGen
DiffM
15
16
0
21 May 2023
Any-to-Any Generation via Composable Diffusion
Any-to-Any Generation via Composable Diffusion
Zineng Tang
Ziyi Yang
Chenguang Zhu
Michael Zeng
Mohit Bansal
VGen
DiffM
18
169
0
19 May 2023
Generative Pre-trained Transformer: A Comprehensive Review on Enabling
  Technologies, Potential Applications, Emerging Challenges, and Future
  Directions
Generative Pre-trained Transformer: A Comprehensive Review on Enabling Technologies, Potential Applications, Emerging Challenges, and Future Directions
Gokul Yenduri
M. Ramalingam
G. C. Selvi
Y. Supriya
Gautam Srivastava
...
Rutvij H. Jhaveri
B. Prabadevi
Weizheng Wang
Athanasios V. Vasilakos
Thippa Reddy Gadekallu
AI4CE
LM&MA
8
158
0
11 May 2023
VideoOFA: Two-Stage Pre-Training for Video-to-Text Generation
VideoOFA: Two-Stage Pre-Training for Video-to-Text Generation
Xilun Chen
L. Yu
Wenhan Xiong
Barlas Ouguz
Yashar Mehdad
Wen-tau Yih
VGen
26
3
0
04 May 2023
MH-DETR: Video Moment and Highlight Detection with Cross-modal
  Transformer
MH-DETR: Video Moment and Highlight Detection with Cross-modal Transformer
Yifang Xu
Yunzhuo Sun
Yang Li
Yilei Shi
Xiaoxia Zhu
S. Du
ViT
35
33
0
29 Apr 2023
A Review of Deep Learning for Video Captioning
A Review of Deep Learning for Video Captioning
Moloud Abdar
Meenakshi Kollati
Swaraja Kuraparthi
Farhad Pourpanah
Daniel J. McDuff
...
Shuicheng Yan
Abduallah A. Mohamed
Abbas Khosravi
Erik Cambria
Fatih Porikli
3DV
17
20
0
22 Apr 2023
VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset
VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset
Sihan Chen
Xingjian He
Longteng Guo
Xinxin Zhu
Weining Wang
Jinhui Tang
Jinhui Tang
VLM
26
99
0
17 Apr 2023
Verbs in Action: Improving verb understanding in video-language models
Verbs in Action: Improving verb understanding in video-language models
Liliane Momeni
Mathilde Caron
Arsha Nagrani
Andrew Zisserman
Cordelia Schmid
30
69
0
13 Apr 2023
SoccerNet-Caption: Dense Video Captioning for Soccer Broadcasts
  Commentaries
SoccerNet-Caption: Dense Video Captioning for Soccer Broadcasts Commentaries
Hassan Mkhallati
A. Cioppa
Silvio Giancola
Bernard Ghanem
Marc Van Droogenbroeck
22
32
0
10 Apr 2023
Video ChatCaptioner: Towards Enriched Spatiotemporal Descriptions
Video ChatCaptioner: Towards Enriched Spatiotemporal Descriptions
Jun Chen
Deyao Zhu
Kilichbek Haydarov
Xiang Li
Mohamed Elhoseiny
20
37
0
09 Apr 2023
Scalable and Accurate Self-supervised Multimodal Representation Learning
  without Aligned Video and Text Data
Scalable and Accurate Self-supervised Multimodal Representation Learning without Aligned Video and Text Data
Vladislav Lialin
Stephen Rawls
David M. Chan
Shalini Ghosh
Anna Rumshisky
Wael Hamza
VLM
AI4TS
17
6
0
04 Apr 2023
Procedure-Aware Pretraining for Instructional Video Understanding
Procedure-Aware Pretraining for Instructional Video Understanding
Honglu Zhou
Roberto Martín-Martín
Mubbasir Kapadia
Silvio Savarese
Juan Carlos Niebles
23
38
0
31 Mar 2023
Self-Supervised Multimodal Learning: A Survey
Self-Supervised Multimodal Learning: A Survey
Yongshuo Zong
Oisin Mac Aodha
Timothy M. Hospedales
SSL
16
43
0
31 Mar 2023
AutoAD: Movie Description in Context
AutoAD: Movie Description in Context
Tengda Han
Max Bain
Arsha Nagrani
Gül Varol
Weidi Xie
Andrew Zisserman
VGen
14
34
0
29 Mar 2023
MaMMUT: A Simple Architecture for Joint Learning for MultiModal Tasks
MaMMUT: A Simple Architecture for Joint Learning for MultiModal Tasks
Weicheng Kuo
A. Piergiovanni
Dahun Kim
Xiyang Luo
Benjamin Caine
...
Luowei Zhou
Andrew M. Dai
Zhifeng Chen
Claire Cui
A. Angelova
MLLM
VLM
17
23
0
29 Mar 2023
Language-Guided Audio-Visual Source Separation via Trimodal Consistency
Language-Guided Audio-Visual Source Separation via Trimodal Consistency
Reuben Tan
Arijit Ray
Andrea Burns
Bryan A. Plummer
Justin Salamon
Oriol Nieto
Bryan C. Russell
Kate Saenko
15
19
0
28 Mar 2023
SEM-POS: Grammatically and Semantically Correct Video Captioning
SEM-POS: Grammatically and Semantically Correct Video Captioning
Asmar Nadeem
A. Hilton
R. Dawes
Graham A. Thomas
A. Mustafa
19
8
0
26 Mar 2023
GOAL: A Challenging Knowledge-grounded Video Captioning Benchmark for
  Real-time Soccer Commentary Generation
GOAL: A Challenging Knowledge-grounded Video Captioning Benchmark for Real-time Soccer Commentary Generation
Ji Qi
Jifan Yu
Teng Tu
Kunyu Gao
Yifan Xu
...
Juanzi Li
Jie Tang
Weidong Guo
Hui Liu
Yu-Syuan Xu
20
19
0
26 Mar 2023
Text with Knowledge Graph Augmented Transformer for Video Captioning
Text with Knowledge Graph Augmented Transformer for Video Captioning
Xin Gu
G. Chen
Yufei Wang
Libo Zhang
Tiejian Luo
Longyin Wen
11
47
0
22 Mar 2023
eP-ALM: Efficient Perceptual Augmentation of Language Models
eP-ALM: Efficient Perceptual Augmentation of Language Models
Mustafa Shukor
Corentin Dancette
Matthieu Cord
MLLM
VLM
24
29
0
20 Mar 2023
Implicit and Explicit Commonsense for Multi-sentence Video Captioning
Implicit and Explicit Commonsense for Multi-sentence Video Captioning
Shih-Han Chou
James J. Little
Leonid Sigal
15
2
0
14 Mar 2023
Selectively Hard Negative Mining for Alleviating Gradient Vanishing in
  Image-Text Matching
Selectively Hard Negative Mining for Alleviating Gradient Vanishing in Image-Text Matching
Zheng Li
Caili Guo
Xin Eric Wang
Zerun Feng
Zhongtian Du
VLM
10
4
0
01 Mar 2023
Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense
  Video Captioning
Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning
Antoine Yang
Arsha Nagrani
Paul Hongsuck Seo
Antoine Miech
Jordi Pont-Tuset
Ivan Laptev
Josef Sivic
Cordelia Schmid
AI4TS
VLM
18
219
0
27 Feb 2023
Large-scale Multi-Modal Pre-trained Models: A Comprehensive Survey
Large-scale Multi-Modal Pre-trained Models: A Comprehensive Survey
Xiao Wang
Guangyao Chen
Guangwu Qian
Pengcheng Gao
Xiaoyong Wei
Yaowei Wang
Yonghong Tian
Wen Gao
AI4CE
VLM
24
195
0
20 Feb 2023
STOA-VLP: Spatial-Temporal Modeling of Object and Action for
  Video-Language Pre-training
STOA-VLP: Spatial-Temporal Modeling of Object and Action for Video-Language Pre-training
Weihong Zhong
Mao Zheng
Duyu Tang
Xuan Luo
Heng Gong
Xiaocheng Feng
Bing Qin
22
8
0
20 Feb 2023
mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image
  and Video
mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video
Haiyang Xu
Qinghao Ye
Mingshi Yan
Yaya Shi
Jiabo Ye
...
Guohai Xu
Ji Zhang
Songfang Huang
Feiran Huang
Jingren Zhou
MLLM
VLM
MoE
23
158
0
01 Feb 2023
Semi-Parametric Video-Grounded Text Generation
Semi-Parametric Video-Grounded Text Generation
Sungdong Kim
Jin-Hwa Kim
Jiyoung Lee
Minjoon Seo
VGen
17
12
0
27 Jan 2023
Temporal Perceiving Video-Language Pre-training
Temporal Perceiving Video-Language Pre-training
Fan Ma
Xiaojie Jin
Heng Wang
Jingjia Huang
Linchao Zhu
Jiashi Feng
Yi Yang
VLM
13
15
0
18 Jan 2023
What You Say Is What You Show: Visual Narration Detection in
  Instructional Videos
What You Say Is What You Show: Visual Narration Detection in Instructional Videos
Kumar Ashutosh
Rohit Girdhar
Lorenzo Torresani
Kristen Grauman
11
4
0
05 Jan 2023
HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training
HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training
Qinghao Ye
Guohai Xu
Ming Yan
Haiyang Xu
Qi Qian
Ji Zhang
Fei Huang
VLM
AI4TS
163
69
0
30 Dec 2022
METEOR Guided Divergence for Video Captioning
METEOR Guided Divergence for Video Captioning
D. Rothenpieler
Shahin Amiriparian
15
3
0
20 Dec 2022
VindLU: A Recipe for Effective Video-and-Language Pretraining
VindLU: A Recipe for Effective Video-and-Language Pretraining
Feng Cheng
Xizi Wang
Jie Lei
David J. Crandall
Mohit Bansal
Gedas Bertasius
VLM
27
78
0
09 Dec 2022
VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive
  Captioners
VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners
Shen Yan
Tao Zhu
Zirui Wang
Yuan Cao
Mi Zhang
Soham Ghosh
Yonghui Wu
Jiahui Yu
VLM
VGen
14
45
0
09 Dec 2022
Learning Video Representations from Large Language Models
Learning Video Representations from Large Language Models
Yue Zhao
Ishan Misra
Philipp Krahenbuhl
Rohit Girdhar
VLM
AI4TS
14
162
0
08 Dec 2022
Grafting Pre-trained Models for Multimodal Headline Generation
Grafting Pre-trained Models for Multimodal Headline Generation
Lingfeng Qiao
Chen Wu
Ye Liu
Haoyuan Peng
Di Yin
Bo Ren
30
5
0
14 Nov 2022
Visually-Aware Audio Captioning With Adaptive Audio-Visual Attention
Visually-Aware Audio Captioning With Adaptive Audio-Visual Attention
Xubo Liu
Qiushi Huang
Xinhao Mei
Haohe Liu
Qiuqiang Kong
...
Yu Zhang
Lilian H. Y. Tang
Mark D. Plumbley
Volkan Kilicc
Wenwu Wang
36
18
0
28 Oct 2022
Learning Fine-Grained Visual Understanding for Video Question Answering
  via Decoupling Spatial-Temporal Modeling
Learning Fine-Grained Visual Understanding for Video Question Answering via Decoupling Spatial-Temporal Modeling
Hsin-Ying Lee
Hung-Ting Su
Bing-Chen Tsai
Tsung-Han Wu
Jia-Fong Yeh
Winston H. Hsu
25
2
0
08 Oct 2022
Vision Transformer Based Model for Describing a Set of Images as a Story
Vision Transformer Based Model for Describing a Set of Images as a Story
Zainy M. Malakan
Ghulam Mubashar Hassan
Ajmal Saeed Mian
ViT
23
5
0
06 Oct 2022
An Empirical Study of End-to-End Video-Language Transformers with Masked
  Visual Modeling
An Empirical Study of End-to-End Video-Language Transformers with Masked Visual Modeling
Tsu-jui Fu
Linjie Li
Zhe Gan
Kevin Qinghong Lin
William Yang Wang
Lijuan Wang
Zicheng Liu
VLM
13
63
0
04 Sep 2022
Zero-Shot Video Question Answering via Frozen Bidirectional Language
  Models
Zero-Shot Video Question Answering via Frozen Bidirectional Language Models
Antoine Yang
Antoine Miech
Josef Sivic
Ivan Laptev
Cordelia Schmid
19
225
0
16 Jun 2022
LAVENDER: Unifying Video-Language Understanding as Masked Language
  Modeling
LAVENDER: Unifying Video-Language Understanding as Masked Language Modeling
Linjie Li
Zhe Gan
Kevin Qinghong Lin
Chung-Ching Lin
Zicheng Liu
Ce Liu
Lijuan Wang
MLLM
VLM
18
81
0
14 Jun 2022
Previous
123
Next