ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2011.07231
  4. Cited By
ActBERT: Learning Global-Local Video-Text Representations

ActBERT: Learning Global-Local Video-Text Representations

14 November 2020
Linchao Zhu
Yi Yang
    ViT
ArXivPDFHTML

Papers citing "ActBERT: Learning Global-Local Video-Text Representations"

50 / 269 papers shown
Title
LAVENDER: Unifying Video-Language Understanding as Masked Language
  Modeling
LAVENDER: Unifying Video-Language Understanding as Masked Language Modeling
Linjie Li
Zhe Gan
Kevin Qinghong Lin
Chung-Ching Lin
Zicheng Liu
Ce Liu
Lijuan Wang
MLLM
VLM
18
81
0
14 Jun 2022
Multimodal Learning with Transformers: A Survey
Multimodal Learning with Transformers: A Survey
P. Xu
Xiatian Zhu
David A. Clifton
ViT
41
522
0
13 Jun 2022
Revealing Single Frame Bias for Video-and-Language Learning
Revealing Single Frame Bias for Video-and-Language Learning
Jie Lei
Tamara L. Berg
Mohit Bansal
24
109
0
07 Jun 2022
Revisiting the "Video" in Video-Language Understanding
Revisiting the "Video" in Video-Language Understanding
S. Buch
Cristobal Eyzaguirre
Adrien Gaidon
Jiajun Wu
L. Fei-Fei
Juan Carlos Niebles
25
155
0
03 Jun 2022
Egocentric Video-Language Pretraining
Egocentric Video-Language Pretraining
Kevin Qinghong Lin
Alex Jinpeng Wang
Mattia Soldan
Michael Wray
Rui Yan
...
Hongfa Wang
Dima Damen
Bernard Ghanem
Wei Liu
Mike Zheng Shou
VLM
EgoV
29
188
0
03 Jun 2022
GIT: A Generative Image-to-text Transformer for Vision and Language
GIT: A Generative Image-to-text Transformer for Vision and Language
Jianfeng Wang
Zhengyuan Yang
Xiaowei Hu
Linjie Li
Kevin Qinghong Lin
Zhe Gan
Zicheng Liu
Ce Liu
Lijuan Wang
VLM
27
526
0
27 May 2022
Future Transformer for Long-term Action Anticipation
Future Transformer for Long-term Action Anticipation
Dayoung Gong
Joonseok Lee
Manjin Kim
S. Ha
Minsu Cho
AI4TS
8
61
0
27 May 2022
Learning to Answer Visual Questions from Web Videos
Learning to Answer Visual Questions from Web Videos
Antoine Yang
Antoine Miech
Josef Sivic
Ivan Laptev
Cordelia Schmid
ViT
28
33
0
10 May 2022
CenterCLIP: Token Clustering for Efficient Text-Video Retrieval
CenterCLIP: Token Clustering for Efficient Text-Video Retrieval
Shuai Zhao
Linchao Zhu
Xiaohan Wang
Yi Yang
VLM
CLIP
20
112
0
02 May 2022
Flamingo: a Visual Language Model for Few-Shot Learning
Flamingo: a Visual Language Model for Few-Shot Learning
Jean-Baptiste Alayrac
Jeff Donahue
Pauline Luc
Antoine Miech
Iain Barr
...
Mikolaj Binkowski
Ricardo Barreira
Oriol Vinyals
Andrew Zisserman
Karen Simonyan
MLLM
VLM
46
3,326
0
29 Apr 2022
MILES: Visual BERT Pre-training with Injected Language Semantics for
  Video-text Retrieval
MILES: Visual BERT Pre-training with Injected Language Semantics for Video-text Retrieval
Yuying Ge
Yixiao Ge
Xihui Liu
Alex Jinpeng Wang
Jianping Wu
Ying Shan
Xiaohu Qie
Ping Luo
VLM
9
43
0
26 Apr 2022
Contrastive Language-Action Pre-training for Temporal Localization
Contrastive Language-Action Pre-training for Temporal Localization
Mengmeng Xu
Erhan Gundogdu
⋆⋆ Maksim
Bernard Ghanem
M. Donoser
Loris Bazzani
22
27
0
26 Apr 2022
A Multi-level Alignment Training Scheme for Video-and-Language Grounding
A Multi-level Alignment Training Scheme for Video-and-Language Grounding
Yubo Zhang
Feiyang Niu
Q. Ping
Govind Thattai
CVBM
33
2
0
22 Apr 2022
COTS: Collaborative Two-Stream Vision-Language Pre-Training Model for
  Cross-Modal Retrieval
COTS: Collaborative Two-Stream Vision-Language Pre-Training Model for Cross-Modal Retrieval
Haoyu Lu
Nanyi Fei
Yuqi Huo
Yizhao Gao
Zhiwu Lu
Jiaxin Wen
CLIP
VLM
19
54
0
15 Apr 2022
Tencent Text-Video Retrieval: Hierarchical Cross-Modal Interactions with
  Multi-Level Representations
Tencent Text-Video Retrieval: Hierarchical Cross-Modal Interactions with Multi-Level Representations
Jie Jiang
Shaobo Min
Weijie Kong
Dihong Gong
Hongfa Wang
Zhifeng Li
Wei Liu
VLM
18
18
0
07 Apr 2022
Temporal Alignment Networks for Long-term Video
Temporal Alignment Networks for Long-term Video
Tengda Han
Weidi Xie
Andrew Zisserman
AI4TS
20
82
0
06 Apr 2022
ECLIPSE: Efficient Long-range Video Retrieval using Sight and Sound
ECLIPSE: Efficient Long-range Video Retrieval using Sight and Sound
Yan-Bo Lin
Jie Lei
Mohit Bansal
Gedas Bertasius
31
39
0
06 Apr 2022
GEB+: A Benchmark for Generic Event Boundary Captioning, Grounding and Retrieval
GEB+: A Benchmark for Generic Event Boundary Captioning, Grounding and Retrieval
Yuxuan Wang
Difei Gao
Licheng Yu
Stan Weixian Lei
Matt Feiszli
Mike Zheng Shou
9
24
0
01 Apr 2022
Video-Text Representation Learning via Differentiable Weak Temporal
  Alignment
Video-Text Representation Learning via Differentiable Weak Temporal Alignment
Dohwan Ko
Joonmyung Choi
Juyeon Ko
Shinyeong Noh
Kyoung-Woon On
Eun-Sol Kim
Hyunwoo J. Kim
VGen
AI4TS
12
22
0
31 Mar 2022
CREATE: A Benchmark for Chinese Short Video Retrieval and Title
  Generation
CREATE: A Benchmark for Chinese Short Video Retrieval and Title Generation
Ziqi Zhang
Yuxin Chen
Zongyang Ma
Zhongang Qi
Chunfen Yuan
Bing Li
Ying Shan
Weiming Hu
VGen
19
8
0
31 Mar 2022
TubeDETR: Spatio-Temporal Video Grounding with Transformers
TubeDETR: Spatio-Temporal Video Grounding with Transformers
Antoine Yang
Antoine Miech
Josef Sivic
Ivan Laptev
Cordelia Schmid
ViT
28
94
0
30 Mar 2022
End-to-End Transformer Based Model for Image Captioning
End-to-End Transformer Based Model for Image Captioning
Yiyu Wang
Jungang Xu
Yingfei Sun
VLM
ViT
18
117
0
29 Mar 2022
X-Pool: Cross-Modal Language-Video Attention for Text-Video Retrieval
X-Pool: Cross-Modal Language-Video Attention for Text-Video Retrieval
S. Gorti
Noël Vouitsis
Junwei Ma
Keyvan Golestan
M. Volkovs
Animesh Garg
Guangwei Yu
25
148
0
28 Mar 2022
Locality-Aware Inter-and Intra-Video Reconstruction for Self-Supervised
  Correspondence Learning
Locality-Aware Inter-and Intra-Video Reconstruction for Self-Supervised Correspondence Learning
Liulei Li
Tianfei Zhou
Wenguan Wang
Lu Yang
Jian-Wei Li
Yi Yang
SSL
17
47
0
27 Mar 2022
Audio-Adaptive Activity Recognition Across Video Domains
Audio-Adaptive Activity Recognition Across Video Domains
Yun C. Zhang
Hazel Doughty
Ling Shao
Cees G. M. Snoek
15
38
0
27 Mar 2022
Bilaterally Slimmable Transformer for Elastic and Efficient Visual
  Question Answering
Bilaterally Slimmable Transformer for Elastic and Efficient Visual Question Answering
Zhou Yu
Zitian Jin
Jun Yu
Mingliang Xu
Hongbo Wang
Jianping Fan
25
4
0
24 Mar 2022
How Do You Do It? Fine-Grained Action Understanding with Pseudo-Adverbs
How Do You Do It? Fine-Grained Action Understanding with Pseudo-Adverbs
Hazel Doughty
Cees G. M. Snoek
20
19
0
23 Mar 2022
Local-Global Context Aware Transformer for Language-Guided Video
  Segmentation
Local-Global Context Aware Transformer for Language-Guided Video Segmentation
Chen Liang
Wenguan Wang
Tianfei Zhou
Jiaxu Miao
Yawei Luo
Yi Yang
VOS
22
74
0
18 Mar 2022
Revitalize Region Feature for Democratizing Video-Language Pre-training
  of Retrieval
Revitalize Region Feature for Democratizing Video-Language Pre-training of Retrieval
Guanyu Cai
Yixiao Ge
Binjie Zhang
Alex Jinpeng Wang
Rui Yan
...
Ying Shan
Lianghua He
Xiaohu Qie
Jianping Wu
Mike Zheng Shou
VLM
10
6
0
15 Mar 2022
All in One: Exploring Unified Video-Language Pre-training
All in One: Exploring Unified Video-Language Pre-training
Alex Jinpeng Wang
Yixiao Ge
Rui Yan
Yuying Ge
Xudong Lin
Guanyu Cai
Jianping Wu
Ying Shan
Xiaohu Qie
Mike Zheng Shou
14
200
0
14 Mar 2022
Video Question Answering: Datasets, Algorithms and Challenges
Video Question Answering: Datasets, Algorithms and Challenges
Yaoyao Zhong
Junbin Xiao
Wei Ji
Yicong Li
Wei Deng
Tat-Seng Chua
16
84
0
02 Mar 2022
VLP: A Survey on Vision-Language Pre-training
VLP: A Survey on Vision-Language Pre-training
Feilong Chen
Duzhen Zhang
Minglun Han
Xiuyi Chen
Jing Shi
Shuang Xu
Bo Xu
VLM
82
212
0
18 Feb 2022
Multi-View Fusion Transformer for Sensor-Based Human Activity
  Recognition
Multi-View Fusion Transformer for Sensor-Based Human Activity Recognition
Yimu Wang
Kun Yu
Yan Wang
Huiwen Xue
HAI
8
2
0
16 Feb 2022
BLIP: Bootstrapping Language-Image Pre-training for Unified
  Vision-Language Understanding and Generation
BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
Junnan Li
Dongxu Li
Caiming Xiong
S. Hoi
MLLM
BDL
VLM
CLIP
390
4,124
0
28 Jan 2022
Learning To Recognize Procedural Activities with Distant Supervision
Learning To Recognize Procedural Activities with Distant Supervision
Xudong Lin
Fabio Petroni
Gedas Bertasius
Marcus Rohrbach
Shih-Fu Chang
Lorenzo Torresani
22
82
0
26 Jan 2022
Reading-strategy Inspired Visual Representation Learning for
  Text-to-Video Retrieval
Reading-strategy Inspired Visual Representation Learning for Text-to-Video Retrieval
Jianfeng Dong
Yabing Wang
Xianke Chen
Xiaoye Qu
Xirong Li
Y. He
Xun Wang
6
58
0
23 Jan 2022
A Pre-trained Audio-Visual Transformer for Emotion Recognition
A Pre-trained Audio-Visual Transformer for Emotion Recognition
Minh Tran
M. Soleymani
58
25
0
23 Jan 2022
End-to-end Generative Pretraining for Multimodal Video Captioning
End-to-end Generative Pretraining for Multimodal Video Captioning
Paul Hongsuck Seo
Arsha Nagrani
Anurag Arnab
Cordelia Schmid
27
164
0
20 Jan 2022
Temporal Sentence Grounding in Videos: A Survey and Future Directions
Temporal Sentence Grounding in Videos: A Survey and Future Directions
Hao Zhang
Aixin Sun
Wei Jing
Joey Tianyi Zhou
3DGS
34
38
0
20 Jan 2022
Video Transformers: A Survey
Video Transformers: A Survey
Javier Selva
A. S. Johansen
Sergio Escalera
Kamal Nasrollahi
T. Moeslund
Albert Clapés
ViT
20
102
0
16 Jan 2022
Bridging Video-text Retrieval with Multiple Choice Questions
Bridging Video-text Retrieval with Multiple Choice Questions
Yuying Ge
Yixiao Ge
Xihui Liu
Dian Li
Ying Shan
Xiaohu Qie
Ping Luo
BDL
16
108
0
13 Jan 2022
Multi-Query Video Retrieval
Multi-Query Video Retrieval
Zeyu Wang
Yu Wu
Karthik Narasimhan
Olga Russakovsky
36
17
0
10 Jan 2022
Progressive Video Summarization via Multimodal Self-supervised Learning
Progressive Video Summarization via Multimodal Self-supervised Learning
Haopeng Li
Qiuhong Ke
Mingming Gong
Tom Drummond
AI4TS
31
18
0
07 Jan 2022
A Survey of Natural Language Generation
A Survey of Natural Language Generation
Chenhe Dong
Yinghui Li
Haifan Gong
M. Chen
Junxin Li
Ying Shen
Min Yang
3DV
19
43
0
22 Dec 2021
Align and Prompt: Video-and-Language Pre-training with Entity Prompts
Align and Prompt: Video-and-Language Pre-training with Entity Prompts
Dongxu Li
Junnan Li
Hongdong Li
Juan Carlos Niebles
S. Hoi
20
191
0
17 Dec 2021
CoCo-BERT: Improving Video-Language Pre-training with Contrastive
  Cross-modal Matching and Denoising
CoCo-BERT: Improving Video-Language Pre-training with Contrastive Cross-modal Matching and Denoising
Jianjie Luo
Yehao Li
Yingwei Pan
Ting Yao
Hongyang Chao
Tao Mei
VLM
16
41
0
14 Dec 2021
Everything at Once -- Multi-modal Fusion Transformer for Video Retrieval
Everything at Once -- Multi-modal Fusion Transformer for Video Retrieval
Nina Shvetsova
Brian Chen
Andrew Rouditchenko
Samuel Thomas
Brian Kingsbury
Rogerio Feris
David F. Harwath
James R. Glass
Hilde Kuehne
ViT
23
129
0
08 Dec 2021
Video-Text Pre-training with Learned Regions
Video-Text Pre-training with Learned Regions
Rui Yan
Mike Zheng Shou
Yixiao Ge
Alex Jinpeng Wang
Xudong Lin
Guanyu Cai
Jinhui Tang
25
23
0
02 Dec 2021
Routing with Self-Attention for Multimodal Capsule Networks
Routing with Self-Attention for Multimodal Capsule Networks
Kevin Duarte
Brian Chen
Nina Shvetsova
Andrew Rouditchenko
Samuel Thomas
Alexander H. Liu
David F. Harwath
James R. Glass
Hilde Kuehne
M. Shah
SSL
29
5
0
01 Dec 2021
Object-aware Video-language Pre-training for Retrieval
Object-aware Video-language Pre-training for Retrieval
Alex Jinpeng Wang
Yixiao Ge
Guanyu Cai
Rui Yan
Xudong Lin
Ying Shan
Xiaohu Qie
Mike Zheng Shou
ViT
VLM
17
79
0
01 Dec 2021
Previous
123456
Next