v1v2 (latest)

VideoBERT: A Joint Model for Video and Language Representation Learning

3 April 2019

Carl Vondrick

Papers citing "VideoBERT: A Joint Model for Video and Language Representation Learning"

50 / 803 papers shown

SSAN: Separable Self-Attention Network for Video Representation LearningComputer Vision and Pattern Recognition (CVPR), 2021

161

27 May 2021

Multi-modal Understanding and Generation for Medical Images and Text via Vision-Language Pre-TrainingIEEE journal of biomedical and health informatics (JBHI), 2021

Young-Hak Kim

226

211

24 May 2021

Pretrained Language Models for Text Generation: A Survey

261

206

21 May 2021

VLM: Task-agnostic Video-Language Model Pre-training for Video UnderstandingFindings (Findings), 2021

Hu Xu

Gargi Ghosh

Po-Yao (Bernie) Huang

Prahal Arora

Masoumeh Aminzadeh

Christoph Feichtenhofer

Florian Metze

Luke Zettlemoyer

327

146

20 May 2021

NExT-QA:Next Phase of Question-Answering to Explaining Temporal ActionsComputer Vision and Pattern Recognition (CVPR), 2021

Junbin Xiao

Xindi Shang

Angela Yao

Tat-Seng Chua

390

721

18 May 2021

Episodic Transformer for Vision-and-Language NavigationIEEE International Conference on Computer Vision (ICCV), 2021

345

212

13 May 2021

Designing Multimodal Datasets for NLP Challenges

201

12 May 2021

Breaking Shortcut: Exploring Fully Convolutional Cycle-Consistency for Video Correspondence Learning

237

12 May 2021

Spoken Moments: Learning Joint Audio-Visual Representations from Video DescriptionsComputer Vision and Pattern Recognition (CVPR), 2021

179

10 May 2021

Recent Advances in Deep Learning Based Dialogue Systems: A Systematic SurveyArtificial Intelligence Review (AIR), 2021

827

322

10 May 2021

ISTR: End-to-End Instance Segmentation with Transformers

Liujuan Cao

170

03 May 2021

MathBERT: A Pre-Trained Model for Mathematical Formula Understanding

Shuai Peng

Ke Yuan

Liangcai Gao

Zhi Tang

AIMat

220

119

02 May 2021

Multimodal Clustering Networks for Self-supervised Learning from Unlabeled VideosIEEE International Conference on Computer Vision (ICCV), 2021

...

429

26 Apr 2021

MusCaps: Generating Captions for Music AudioIEEE International Joint Conference on Neural Network (IJCNN), 2021

281

24 Apr 2021

Playing Lottery Tickets with Vision and LanguageAAAI Conference on Artificial Intelligence (AAAI), 2021

Zicheng Liu

303

23 Apr 2021

Skeletor: Skeletal Transformers for Robust Body-Pose Estimation

238

23 Apr 2021

T2VLAD: Global-Local Sequence Alignment for Text-Video RetrievalComputer Vision and Pattern Recognition (CVPR), 2021

Xiaohan Wang

Linchao Zhu

Yi Yang

376

210

20 Apr 2021

Detector-Free Weakly Supervised Grounding by SeparationIEEE International Conference on Computer Vision (ICCV), 2021

...

182

20 Apr 2021

Temporal Query Networks for Fine-grained Video UnderstandingComputer Vision and Pattern Recognition (CVPR), 2021

Chuhan Zhang

Ankush Gupta

Andrew Zisserman

254

19 Apr 2021

Understanding Chinese Video and Language via Contrastive Multimodal Pre-TrainingACM Multimedia (ACM MM), 2021

163

19 Apr 2021

Multi-Modal Fusion Transformer for End-to-End Autonomous DrivingComputer Vision and Pattern Recognition (CVPR), 2021

274

636

19 Apr 2021

AMMU : A Survey of Transformer-based Biomedical Pretrained Language ModelsJournal of Biomedical Informatics (JBI), 2021

Katikapalli Subramanyam Kalyan

A. Rajasekharan

S. Sangeetha

LM&MA MedIm

389

191

16 Apr 2021

Self-supervised object detection from audio-visual correspondenceComputer Vision and Pattern Recognition (CVPR), 2021

Triantafyllos Afouras

Yuki M. Asano

Francois Fagan

Andrea Vedaldi

Florian Metze

SSL

322

13 Apr 2021

FIBER: Fill-in-the-Blanks as a Challenging Video Understanding Evaluation FrameworkAnnual Meeting of the Association for Computational Linguistics (ACL), 2021

281

09 Apr 2021

Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation LearningComputer Vision and Pattern Recognition (CVPR), 2021

424

303

07 Apr 2021

Compressing Visual-linguistic Model via Knowledge DistillationIEEE International Conference on Computer Vision (ICCV), 2021

Zhiyuan Fang

Jianfeng Wang

Xiaowei Hu

Lijuan Wang

Yezhou Yang

Zicheng Liu

VLM

280

116

05 Apr 2021

Self-supervised Video Representation Learning by Context and Motion DecouplingComputer Vision and Pattern Recognition (CVPR), 2021

223

02 Apr 2021

CUPID: Adaptive Curation of Pre-training Data for Video-and-Language Representation Learning

Lei Zhang

196

01 Apr 2021

Diagnosing Vision-and-Language Navigation: What Really MattersNorth American Chapter of the Association for Computational Linguistics (NAACL), 2021

Qi Wu

233

30 Mar 2021

Broaden Your Views for Self-Supervised Video LearningIEEE International Conference on Computer Vision (ICCV), 2021

Adrià Recasens

Pauline Luc

Jean-Baptiste Alayrac

...

293

138

30 Mar 2021

Kaleido-BERT: Vision-Language Pre-training on Fashion DomainComputer Vision and Pattern Recognition (CVPR), 2021

344

134

30 Mar 2021

HiT: Hierarchical Transformer with Momentum Contrast for Video-Text RetrievalIEEE International Conference on Computer Vision (ICCV), 2021

336

165

28 Mar 2021

A Comprehensive Review of the Video-to-Text ProblemArtificial Intelligence Review (AIR), 2021

267

27 Mar 2021

Understanding Robustness of Transformers for Image ClassificationIEEE International Conference on Computer Vision (ICCV), 2021

Srinadh Bhojanapalli

309

468

26 Mar 2021

VLGrammar: Grounded Grammar Induction of Vision and LanguageIEEE International Conference on Computer Vision (ICCV), 2021

174

24 Mar 2021

DeepViT: Towards Deeper Vision Transformer

Linjie Yang

338

601

22 Mar 2021

Incorporating Convolution Designs into Visual TransformersIEEE International Conference on Computer Vision (ICCV), 2021

Ziwei Liu

297

566

22 Mar 2021

Let Your Heart Speak in its Mother Tongue: Multilingual Captioning of Cardiac Signals

Dani Kiyasseh

T. Zhu

David Clifton

238

19 Mar 2021

MDMMT: Multidomain Multimodal Transformer for Video Retrieval

Maksim Dzabraev

M. Kalashnikov

Stepan Alekseevich Komkov

Aleksandr Petiushko

221

148

19 Mar 2021

ConViT: Improving Vision Transformers with Soft Convolutional Inductive BiasesInternational Conference on Machine Learning (ICML), 2021

432

953

19 Mar 2021

Space-Time Crop & Attend: Improving Cross-modal Video Representation LearningIEEE International Conference on Computer Vision (ICCV), 2021

Joao Henriques

Andrea Vedaldi

AI4TS

271

18 Mar 2021

Unified Pre-training for Program Understanding and GenerationNorth American Chapter of the Association for Computational Linguistics (NAACL), 2021

417

851

10 Mar 2021

Involution: Inverting the Inherence of Convolution for Visual RecognitionComputer Vision and Pattern Recognition (CVPR), 2021

Xiangtai Li

Lei Zhu

Tong Zhang

Qifeng Chen

BDL

224

358

10 Mar 2021

Variable-rate discrete representation learning

209

10 Mar 2021

VideoMoCo: Contrastive Video Representation Learning with Temporally Adversarial ExamplesComputer Vision and Pattern Recognition (CVPR), 2021

Wei Liu

244

252

10 Mar 2021

Perspectives and Prospects on Transformer Architecture for Cross-Modal Tasks with Language and VisionInternational Journal of Computer Vision (IJCV), 2021

Andrew Shin

Masato Ishii

T. Narihira

289

06 Mar 2021

A Straightforward Framework For Video Retrieval Using CLIPMexican Conference on Pattern Recognition (MPR), 2021

Jesús Andrés Portillo-Quintero

J. C. Ortíz-Bayliss

Hugo Terashima-Marín

CLIP

719

134

24 Feb 2021

LambdaNetworks: Modeling Long-Range Interactions Without AttentionInternational Conference on Learning Representations (ICLR), 2021

Irwan Bello

506

187

17 Feb 2021

Less is More: ClipBERT for Video-and-Language Learning via Sparse SamplingComputer Vision and Pattern Recognition (CVPR), 2021

457

748

11 Feb 2021

Is Space-Time Attention All You Need for Video Understanding?International Conference on Machine Learning (ICML), 2021

Gedas Bertasius

Heng Wang

Lorenzo Torresani

ViT

1.1K

2,648

09 Feb 2021