ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2011.07231
  4. Cited By
ActBERT: Learning Global-Local Video-Text Representations

ActBERT: Learning Global-Local Video-Text Representations

14 November 2020
Linchao Zhu
Yi Yang
    ViT
ArXivPDFHTML

Papers citing "ActBERT: Learning Global-Local Video-Text Representations"

50 / 269 papers shown
Title
Beyond One-to-One: Rethinking the Referring Image Segmentation
Beyond One-to-One: Rethinking the Referring Image Segmentation
Yutao Hu
Qixiong Wang
Wenqi Shao
Enze Xie
Zhenguo Li
Jungong Han
Ping Luo
3DV
14
37
0
26 Aug 2023
Preserving Modality Structure Improves Multi-Modal Learning
Preserving Modality Structure Improves Multi-Modal Learning
Swetha Sirnam
Mamshad Nayeem Rizve
Nina Shvetsova
Hilde Kuehne
M. Shah
23
4
0
24 Aug 2023
UnLoc: A Unified Framework for Video Localization Tasks
UnLoc: A Unified Framework for Video Localization Tasks
Shengjia Yan
Xuehan Xiong
Arsha Nagrani
Anurag Arnab
Zhonghao Wang
Weina Ge
David A. Ross
Cordelia Schmid
19
53
0
21 Aug 2023
Tem-adapter: Adapting Image-Text Pretraining for Video Question Answer
Tem-adapter: Adapting Image-Text Pretraining for Video Question Answer
Guangyi Chen
Xiao Liu
Guangrun Wang
Kun Zhang
Philip H.S.Torr
Xiaoping Zhang
Yansong Tang
19
18
0
16 Aug 2023
MovieChat: From Dense Token to Sparse Memory for Long Video
  Understanding
MovieChat: From Dense Token to Sparse Memory for Long Video Understanding
Enxin Song
Wenhao Chai
Guanhong Wang
Yucheng Zhang
Haoyang Zhou
...
Tianbo Ye
Yanting Zhang
Yang Lu
Jenq-Neng Hwang
Gaoang Wang
VLM
MLLM
22
260
0
31 Jul 2023
Keyword-Aware Relative Spatio-Temporal Graph Networks for Video Question
  Answering
Keyword-Aware Relative Spatio-Temporal Graph Networks for Video Question Answering
Yi Cheng
Hehe Fan
Dongyun Lin
Ying Sun
Mohan S. Kankanhalli
J. Lim
32
4
0
25 Jul 2023
Audio-Enhanced Text-to-Video Retrieval using Text-Conditioned Feature
  Alignment
Audio-Enhanced Text-to-Video Retrieval using Text-Conditioned Feature Alignment
Sarah Ibrahimi
Xiaohang Sun
Pichao Wang
Amanmeet Garg
Ashutosh Sanan
Mohamed Omar
44
14
0
24 Jul 2023
Video-Mined Task Graphs for Keystep Recognition in Instructional Videos
Video-Mined Task Graphs for Keystep Recognition in Instructional Videos
Kumar Ashutosh
Santhosh Kumar Ramakrishnan
Triantafyllos Afouras
Kristen Grauman
21
23
0
17 Jul 2023
InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding
  and Generation
InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation
Yi Wang
Yinan He
Yizhuo Li
Kunchang Li
Jiashuo Yu
...
Ping Luo
Ziwei Liu
Yali Wang
Limin Wang
Yu Qiao
VLM
VGen
25
244
0
13 Jul 2023
EgoVLPv2: Egocentric Video-Language Pre-training with Fusion in the
  Backbone
EgoVLPv2: Egocentric Video-Language Pre-training with Fusion in the Backbone
Shraman Pramanick
Yale Song
Sayan Nag
Kevin Qinghong Lin
Hardik Shah
Mike Zheng Shou
Ramalingam Chellappa
Pengchuan Zhang
VLM
34
86
0
11 Jul 2023
One-Versus-Others Attention: Scalable Multimodal Integration for
  Clinical Data
One-Versus-Others Attention: Scalable Multimodal Integration for Clinical Data
Michal Golovanevsky
Eva Schiller
Akira Nair
Ritambhara Singh
Carsten Eickhoff
16
2
0
11 Jul 2023
MPM: A Unified 2D-3D Human Pose Representation via Masked Pose Modeling
MPM: A Unified 2D-3D Human Pose Representation via Masked Pose Modeling
Zhenyu Zhang
Wenhao Chai
Zhongyu Jiang
Tianbo Ye
Mingli Song
Jenq-Neng Hwang
Gaoang Wang
3DH
26
4
0
29 Jun 2023
Exploring the Role of Audio in Video Captioning
Exploring the Role of Audio in Video Captioning
Yuhan Shen
Linjie Yang
Longyin Wen
Haichao Yu
Ehsan Elhamifar
Heng Wang
18
2
0
21 Jun 2023
Learning to Ground Instructional Articles in Videos through Narrations
Learning to Ground Instructional Articles in Videos through Narrations
E. Mavroudi
Triantafyllos Afouras
Lorenzo Torresani
DiffM
25
21
0
06 Jun 2023
MoviePuzzle: Visual Narrative Reasoning through Multimodal Order
  Learning
MoviePuzzle: Visual Narrative Reasoning through Multimodal Order Learning
Jianghui Wang
Yuxuan Wang
Dongyan Zhao
Zilong Zheng
37
1
0
04 Jun 2023
Recent Advances of Local Mechanisms in Computer Vision: A Survey and
  Outlook of Recent Work
Recent Advances of Local Mechanisms in Computer Vision: A Survey and Outlook of Recent Work
Qiangchang Wang
Yilong Yin
23
0
0
02 Jun 2023
VLAB: Enhancing Video Language Pre-training by Feature Adapting and
  Blending
VLAB: Enhancing Video Language Pre-training by Feature Adapting and Blending
Xingjian He
Sihan Chen
Fan Ma
Zhicheng Huang
Xiaojie Jin
Zikang Liu
Dongmei Fu
Yi Yang
J. Liu
Jiashi Feng
VLM
CLIP
18
17
0
22 May 2023
Is a Video worth $n\times n$ Images? A Highly Efficient Approach to
  Transformer-based Video Question Answering
Is a Video worth n×nn\times nn×n Images? A Highly Efficient Approach to Transformer-based Video Question Answering
Chenyang Lyu
Tianbo Ji
Yvette Graham
Jennifer Foster
ViT
19
0
0
16 May 2023
Mask to reconstruct: Cooperative Semantics Completion for Video-text
  Retrieval
Mask to reconstruct: Cooperative Semantics Completion for Video-text Retrieval
Han Fang
Zhifei Yang
Xianghao Zang
Chao Ban
Hao Sun
VGen
24
2
0
13 May 2023
VideoChat: Chat-Centric Video Understanding
VideoChat: Chat-Centric Video Understanding
Kunchang Li
Yinan He
Yi Wang
Yizhuo Li
Wen Wang
Ping Luo
Yali Wang
Limin Wang
Yu Qiao
MLLM
35
527
0
10 May 2023
A Dual Semantic-Aware Recurrent Global-Adaptive Network For
  Vision-and-Language Navigation
A Dual Semantic-Aware Recurrent Global-Adaptive Network For Vision-and-Language Navigation
Liuyi Wang
Zongtao He
Jiagui Tang
Ronghao Dang
Naijia Wang
Chengju Liu
Qi Chen
22
17
0
05 May 2023
A Large Cross-Modal Video Retrieval Dataset with Reading Comprehension
A Large Cross-Modal Video Retrieval Dataset with Reading Comprehension
Weijia Wu
Yuzhong Zhao
Zhuangzi Li
Jiahong Li
Hong Zhou
Mike Zheng Shou
Xiang Bai
23
19
0
05 May 2023
VideoOFA: Two-Stage Pre-Training for Video-to-Text Generation
VideoOFA: Two-Stage Pre-Training for Video-to-Text Generation
Xilun Chen
L. Yu
Wenhan Xiong
Barlas Ouguz
Yashar Mehdad
Wen-tau Yih
VGen
26
3
0
04 May 2023
Making the Most of What You Have: Adapting Pre-trained Visual Language
  Models in the Low-data Regime
Making the Most of What You Have: Adapting Pre-trained Visual Language Models in the Low-data Regime
Chuhan Zhang
Antoine Miech
Jiajun Shen
Jean-Baptiste Alayrac
Pauline Luc
VLM
VPVLM
39
2
0
03 May 2023
SViTT: Temporal Learning of Sparse Video-Text Transformers
SViTT: Temporal Learning of Sparse Video-Text Transformers
Yi Li
Kyle Min
Subarna Tripathi
Nuno Vasconcelos
17
12
0
18 Apr 2023
Efficient Multimodal Fusion via Interactive Prompting
Efficient Multimodal Fusion via Interactive Prompting
Yaowei Li
Ruijie Quan
Linchao Zhu
Yezhou Yang
28
43
0
13 Apr 2023
Interaction-Aware Prompting for Zero-Shot Spatio-Temporal Action
  Detection
Interaction-Aware Prompting for Zero-Shot Spatio-Temporal Action Detection
Wei-Jhe Huang
Jheng-Hsien Yeh
Min-Hung Chen
Gueter Josmy Faure
S. Lai
28
3
0
10 Apr 2023
DATE: Domain Adaptive Product Seeker for E-commerce
DATE: Domain Adaptive Product Seeker for E-commerce
Haoyuan Li
Haojie Jiang
Tao Jin
Meng-Juan Li
Yan Chen
Zhijie Lin
Yang Zhao
Zhou Zhao
26
6
0
07 Apr 2023
Scalable and Accurate Self-supervised Multimodal Representation Learning
  without Aligned Video and Text Data
Scalable and Accurate Self-supervised Multimodal Representation Learning without Aligned Video and Text Data
Vladislav Lialin
Stephen Rawls
David M. Chan
Shalini Ghosh
Anna Rumshisky
Wael Hamza
VLM
AI4TS
28
6
0
04 Apr 2023
Self-Supervised Multimodal Learning: A Survey
Self-Supervised Multimodal Learning: A Survey
Yongshuo Zong
Oisin Mac Aodha
Timothy M. Hospedales
SSL
19
43
0
31 Mar 2023
Learning Procedure-aware Video Representation from Instructional Videos
  and Their Narrations
Learning Procedure-aware Video Representation from Instructional Videos and Their Narrations
Yiwu Zhong
Licheng Yu
Yang Bai
Shangwen Li
Xueting Yan
Yin Li
AI4TS
30
31
0
31 Mar 2023
Structured Video-Language Modeling with Temporal Grouping and Spatial
  Grounding
Structured Video-Language Modeling with Temporal Grouping and Spatial Grounding
Yuanhao Xiong
Long Zhao
Boqing Gong
Ming-Hsuan Yang
Florian Schroff
Ting Liu
Cho-Jui Hsieh
Liangzhe Yuan
VLM
19
0
0
28 Mar 2023
MELTR: Meta Loss Transformer for Learning to Fine-tune Video Foundation
  Models
MELTR: Meta Loss Transformer for Learning to Fine-tune Video Foundation Models
Dohwan Ko
Joon-Young Choi
Hyeong Kyu Choi
Kyoung-Woon On
Byungseok Roh
Hyunwoo J. Kim
44
19
0
23 Mar 2023
Text with Knowledge Graph Augmented Transformer for Video Captioning
Text with Knowledge Graph Augmented Transformer for Video Captioning
Xin Gu
G. Chen
Yufei Wang
Libo Zhang
Tiejian Luo
Longyin Wen
19
47
0
22 Mar 2023
Weakly Supervised Video Representation Learning with Unaligned Text for
  Sequential Videos
Weakly Supervised Video Representation Learning with Unaligned Text for Sequential Videos
Sixun Dong
Huazhang Hu
Dongze Lian
Weixin Luo
Yichen Qian
Shenghua Gao
ViT
AI4TS
21
11
0
22 Mar 2023
Transformers in Speech Processing: A Survey
Transformers in Speech Processing: A Survey
S. Latif
Aun Zaidi
Heriberto Cuayáhuitl
Fahad Shamshad
Moazzam Shoukat
Junaid Qadir
40
46
0
21 Mar 2023
Accommodating Audio Modality in CLIP for Multimodal Processing
Accommodating Audio Modality in CLIP for Multimodal Processing
Ludan Ruan
Anwen Hu
Yuqing Song
Liang Zhang
S. Zheng
Qin Jin
VLM
16
10
0
12 Mar 2023
Learning Grounded Vision-Language Representation for Versatile
  Understanding in Untrimmed Videos
Learning Grounded Vision-Language Representation for Versatile Understanding in Untrimmed Videos
Teng Wang
Jinrui Zhang
Feng Zheng
Wenhao Jiang
Ran Cheng
Ping Luo
VLM
26
11
0
11 Mar 2023
MuLTI: Efficient Video-and-Language Understanding with Text-Guided
  MultiWay-Sampler and Multiple Choice Modeling
MuLTI: Efficient Video-and-Language Understanding with Text-Guided MultiWay-Sampler and Multiple Choice Modeling
Jiaqi Xu
Bo Liu
Yunkuo Chen
Mengli Cheng
Xing Shi
43
1
0
10 Mar 2023
DeCap: Decoding CLIP Latents for Zero-Shot Captioning via Text-Only
  Training
DeCap: Decoding CLIP Latents for Zero-Shot Captioning via Text-Only Training
Wei Li
Linchao Zhu
Longyin Wen
Yi Yang
VLM
40
86
0
06 Mar 2023
Contrastive Video Question Answering via Video Graph Transformer
Contrastive Video Question Answering via Video Graph Transformer
Junbin Xiao
Pan Zhou
Angela Yao
Yicong Li
Richang Hong
Shuicheng Yan
Tat-Seng Chua
ViT
19
35
0
27 Feb 2023
Deep Learning for Video-Text Retrieval: a Review
Deep Learning for Video-Text Retrieval: a Review
Cunjuan Zhu
Qi Jia
Wei-Neng Chen
Yanming Guo
Yu Liu
24
14
0
24 Feb 2023
Large-scale Multi-Modal Pre-trained Models: A Comprehensive Survey
Large-scale Multi-Modal Pre-trained Models: A Comprehensive Survey
Xiao Wang
Guangyao Chen
Guangwu Qian
Pengcheng Gao
Xiaoyong Wei
Yaowei Wang
Yonghong Tian
Wen Gao
AI4CE
VLM
24
199
0
20 Feb 2023
STOA-VLP: Spatial-Temporal Modeling of Object and Action for
  Video-Language Pre-training
STOA-VLP: Spatial-Temporal Modeling of Object and Action for Video-Language Pre-training
Weihong Zhong
Mao Zheng
Duyu Tang
Xuan Luo
Heng Gong
Xiaocheng Feng
Bing Qin
25
8
0
20 Feb 2023
Video-Text Retrieval by Supervised Sparse Multi-Grained Learning
Video-Text Retrieval by Supervised Sparse Multi-Grained Learning
Yimu Wang
Peng Shi
8
5
0
19 Feb 2023
BEST: BERT Pre-Training for Sign Language Recognition with Coupling
  Tokenization
BEST: BERT Pre-Training for Sign Language Recognition with Coupling Tokenization
Weichao Zhao
Hezhen Hu
Wen-gang Zhou
Jiaxin Shi
Houqiang Li
SLR
21
30
0
10 Feb 2023
Tagging before Alignment: Integrating Multi-Modal Tags for Video-Text
  Retrieval
Tagging before Alignment: Integrating Multi-Modal Tags for Video-Text Retrieval
Yizhen Chen
Jie Wang
Lijian Lin
Zhongang Qi
Jin Ma
Ying Shan
VLM
16
18
0
30 Jan 2023
Temporal Perceiving Video-Language Pre-training
Temporal Perceiving Video-Language Pre-training
Fan Ma
Xiaojie Jin
Heng Wang
Jingjia Huang
Linchao Zhu
Jiashi Feng
Yi Yang
VLM
13
15
0
18 Jan 2023
What You Say Is What You Show: Visual Narration Detection in
  Instructional Videos
What You Say Is What You Show: Visual Narration Detection in Instructional Videos
Kumar Ashutosh
Rohit Girdhar
Lorenzo Torresani
Kristen Grauman
11
4
0
05 Jan 2023
Learning Trajectory-Word Alignments for Video-Language Tasks
Learning Trajectory-Word Alignments for Video-Language Tasks
Xu Yang
Zhang Li
Haiyang Xu
Hanwang Zhang
Qinghao Ye
Chenliang Li
Ming Yan
Yu Zhang
Fei Huang
Songfang Huang
28
7
0
05 Jan 2023
Previous
123456
Next