v1v2 (latest)

VideoBERT: A Joint Model for Video and Language Representation Learning

3 April 2019

Carl Vondrick

Papers citing "VideoBERT: A Joint Model for Video and Language Representation Learning"

50 / 803 papers shown

Learning grounded word meaning representations on similarity graphsConference on Empirical Methods in Natural Language Processing (EMNLP), 2021

Mariella Dimiccoli

H. Wendt

Pau Batlle

155

07 Sep 2021

Sensor-Augmented Egocentric-Video Captioning with Dynamic Modal AttentionACM Multimedia (ACM MM), 2021

212

07 Sep 2021

Vision Guided Generative Pre-trained Language Models for Multimodal Abstractive SummarizationConference on Empirical Methods in Natural Language Processing (EMNLP), 2021

297

06 Sep 2021

Audio-Visual Transformer Based Crowd Counting

237

04 Sep 2021

Zero-shot Natural Language Video LocalizationIEEE International Conference on Computer Vision (ICCV), 2021

348

29 Aug 2021

Drop-DTW: Aligning Common Signal Between Sequences While Dropping OutliersNeural Information Processing Systems (NeurIPS), 2021

Nikita Dvornik

Isma Hadji

Konstantinos G. Derpanis

Animesh Garg

Allan D. Jepson

162

26 Aug 2021

TACo: Token-aware Cascade Contrastive Learning for Video-Text AlignmentIEEE International Conference on Computer Vision (ICCV), 2021

Jianwei Yang

Yonatan Bisk

Jianfeng Gao

226

154

23 Aug 2021

Guiding Query Position and Performing Similar Attention for Transformer-Based Detection Heads

121

22 Aug 2021

Knowledge Perceived Multi-modal Pretraining in E-commerce

Ningyu Zhang

Huajun Chen

232

20 Aug 2021

Investigating transformers in the decomposition of polygonal shapes as point collections

183

17 Aug 2021

Who's Waldo? Linking People Across Text and Images

205

16 Aug 2021

Multiview Detection with Shadow Transformer (and View-Coherent Data Augmentation)ACM Multimedia (ACM MM), 2021

Yunzhong Hou

Liang Zheng

ViT

176

12 Aug 2021

Video Transformer for Deepfake Detection with Incremental LearningACM Multimedia (ACM MM), 2021

Sohail Ahmed Khan

Hang Dai

ViT

209

11 Aug 2021

Vision Transformer with Progressive Sampling

Shuyang Sun

202

03 Aug 2021

Word2Pix: Word to Pixel Cross Attention Transformer in Visual GroundingIEEE Transactions on Neural Networks and Learning Systems (TNNLS), 2021

194

31 Jul 2021

Multimodal Co-learning: Challenges, Applications with Datasets, Recent Advances and Future DirectionsInformation Fusion (Inf. Fusion), 2021

389

175

29 Jul 2021

Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language ProcessingACM Computing Surveys (CSUR), 2021

Graham Neubig

775

4,857

28 Jul 2021

Predicting the Future from First Person (Egocentric) Vision: A SurveyComputer Vision and Image Understanding (CVIU), 2021

203

28 Jul 2021

Exceeding the Limits of Visual-Linguistic Multi-Task Learning

Cameron R. Wolfe

Keld T. Lundgaard

VLM

144

27 Jul 2021

LAORAM: A Look Ahead ORAM Architecture for Training Large Embedding TablesInternational Symposium on Computer Architecture (ISCA), 2021

Rachit Rajat

Yongqin Wang

M. Annavaram

167

16 Jul 2021

BERT-like Pre-training for Symbolic Piano Music Classification Tasks

272

12 Jul 2021

Local-to-Global Self-Attention in Vision Transformers

121

10 Jul 2021

Learning Vision-Guided Quadrupedal Locomotion End-to-End with Cross-Modal TransformersInternational Conference on Learning Representations (ICLR), 2021

306

132

08 Jul 2021

VidLanKD: Improving Language Understanding via Video-Distilled Knowledge Transfer

194

06 Jul 2021

Test-Time Personalization with a Transformer for Human Pose Estimation

302

05 Jul 2021

Cross-Modal Transformer-Based Neural Correction Models for Automatic Speech Recognition

114

04 Jul 2021

Attention Bottlenecks for Multimodal FusionNeural Information Processing Systems (NeurIPS), 2021

577

698

30 Jun 2021

A Generative Model for Raw Audio Using Transformer ArchitecturesInternational Conference on Digital Audio Effects (DAFx), 2021

Prateek Verma

C. Chafe

243

30 Jun 2021

iReason: Multimodal Commonsense Reasoning using Videos and Natural Language with Interpretability

Andrew Wang

Vasu Sharma

CML

137

25 Jun 2021

Towards Long-Form Video UnderstandingComputer Vision and Pattern Recognition (CVPR), 2021

Chaoxia Wu

Philipp Krahenbuhl

VLM ViT

323

194

21 Jun 2021

End-to-end Temporal Action Detection with TransformerIEEE Transactions on Image Processing (TIP), 2021

Xiaolong Liu

Qimeng Wang

Yao Hu

Xu Tang

306

292

18 Jun 2021

All You Can Embed: Natural Language based Vehicle Retrieval with Spatio-Temporal Transformers

134

18 Jun 2021

GEM: A General Evaluation Benchmark for Multimodal TasksFindings (Findings), 2021

204

18 Jun 2021

Pre-Trained Models: Past, Present and FutureAI Open (AO), 2021

Xu Han

Zhengyan Zhang

Ning Ding

Yuxian Gu

Xiao Liu

...

Jun Zhu

385

990

14 Jun 2021

Cross-Modal Attention Consistency for Video-Audio Unsupervised Learning

Shaobo Min

Chuang Gan

Jingdong Wang

150

13 Jun 2021

Transformed CNNs: recasting pre-trained convolutional layers with self-attention

10 Jun 2021

Keeping Your Eye on the Ball: Trajectory Attention in Video TransformersNeural Information Processing Systems (NeurIPS), 2021

Ishan Misra Florian Metze

Christoph Feichtenhofer

Andrea Vedaldi

João F. Henriques

283

340

09 Jun 2021

Semi-Supervised 3D Hand-Object Poses Estimation with Interactions in TimeComputer Vision and Pattern Recognition (CVPR), 2021

Hanwen Jiang

231

192

09 Jun 2021

VALUE: A Multi-Task Benchmark for Video-and-Language Understanding Evaluation

...

Zicheng Liu

265

117

08 Jun 2021

A Survey of TransformersAI Open (AO), 2021

Tianyang Lin

Yuxin Wang

Xiangyang Liu

Xipeng Qiu

ViT

445

1,386

08 Jun 2021

Efficient Training of Visual Transformers with Small DatasetsNeural Information Processing Systems (NeurIPS), 2021

Wei Bi

180

213

07 Jun 2021

BERTGEN: Multi-task Generation through BERTAnnual Meeting of the Association for Computational Linguistics (ACL), 2021

Pranava Madhyastha

111

07 Jun 2021

Transformed ROIs for Capturing Visual Transformations in VideosComputer Vision and Image Understanding (CVIU), 2021

Abhinav Rai

Fadime Sener

Angela Yao

ViT

230

06 Jun 2021

Transferring Knowledge from Text to Video: Zero-Shot Anticipation for Procedural ActionsIEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2021

Fadime Sener

Rishabh Saraf

Angela Yao

LM&Ro

183

06 Jun 2021

MERLOT: Multimodal Neural Script Knowledge ModelsNeural Information Processing Systems (NeurIPS), 2021

Yejin Choi

348

428

04 Jun 2021

Anticipative Video TransformerIEEE International Conference on Computer Vision (ICCV), 2021

Rohit Girdhar

Kristen Grauman

ViT

335

251

03 Jun 2021

TVDIM: Enhancing Image Self-Supervised Pretraining via Noisy Text Data

120

03 Jun 2021

Attention mechanisms and deep learning for machine vision: A survey of the state of the art

A. M. Hafiz

S. A. Parah

R. A. Bhat

227

03 Jun 2021

Connecting Language and Vision for Natural Language-Based Vehicle Retrieval

Shuai Bai

Chang Zhou

Yi Yang

Hongxia Yang

227

31 May 2021

Rethinking the constraints of multimodal fusion: case study in Weakly-Supervised Audio-Visual Video Parsing

223

30 May 2021