v1v2 (latest)

VideoBERT: A Joint Model for Video and Language Representation Learning

3 April 2019

Carl Vondrick

Papers citing "VideoBERT: A Joint Model for Video and Language Representation Learning"

50 / 803 papers shown

NER-BERT: A Pre-trained Model for Low-Resource Entity Tagging

304

01 Dec 2021

CLIP Meets Video Captioning: Concept-Aware Representation Learning Does Matter

Bang-ju Yang

Tong Zhang

Yuexian Zou

CLIP

143

30 Nov 2021

ContIG: Self-supervised Multimodal Contrastive Learning for Medical Imaging with GeneticsComputer Vision and Pattern Recognition (CVPR), 2021

609

26 Nov 2021

SwinBERT: End-to-End Transformers with Sparse Attention for Video CaptioningComputer Vision and Pattern Recognition (CVPR), 2021

Zicheng Liu

351

303

25 Nov 2021

VIOLET : End-to-End Video-Language Transformers with Masked Visual-token Modeling

Zicheng Liu

405

240

24 Nov 2021

Hierarchical Modular Network for Video Captioning

Hanhua Ye

Guorong Li

Yuankai Qi

Shuhui Wang

Qingming Huang

Ming-Hsuan Yang

230

24 Nov 2021

Scaling Up Vision-Language Pre-training for Image Captioning

Xiaowei Hu

Zicheng Liu

423

300

24 Nov 2021

Multi-Person 3D Motion Prediction with Multi-Range Transformers

252

23 Nov 2021

Towards Tokenized Human Dynamics Representation

219

22 Nov 2021

Class-agnostic Object Detection with Multi-modal TransformerEuropean Conference on Computer Vision (ECCV), 2021

Salman Khan

Rao Muhammad Anwer

623

116

22 Nov 2021

Advancing High-Resolution Video-Language Representation with Large-Scale Video TranscriptionsComputer Vision and Pattern Recognition (CVPR), 2021

253

19 Nov 2021

DVCFlow: Modeling Information Flow Towards Human-like Video Captioning

Zhengcong Fei

260

19 Nov 2021

A Survey of Visual TransformersIEEE Transactions on Neural Networks and Learning Systems (TNNLS), 2021

Yang Liu

473

487

11 Nov 2021

CLIP2TV: Align, Match and Distill for Video-Text Retrieval

132

10 Nov 2021

Machine Learning for Multimodal Electronic Health Records-based Research: Challenges and Perspectives

297

09 Nov 2021

NarrationBot and InfoBot: A Hybrid System for Automated Video Description

110

07 Nov 2021

Benchmarking Multimodal AutoML for Tabular Data with Text Fields

155

04 Nov 2021

Revisiting spatio-temporal layouts for compositional action recognitionBritish Machine Vision Conference (BMVC), 2021

Gorjan Radevski

Marie-Francine Moens

Tinne Tuytelaars

212

02 Nov 2021

Masking Modalities for Cross-modal Video RetrievalIEEE Workshop/Winter Conference on Applications of Computer Vision (WACV), 2021

298

01 Nov 2021

With a Little Help from my Temporal Context: Multimodal Egocentric Action RecognitionBritish Machine Vision Conference (BMVC), 2021

Dima Damen

297

01 Nov 2021

Cross-Modality Fusion Transformer for Multispectral Object DetectionSocial Science Research Network (SSRN), 2021

301

269

30 Oct 2021

MEmoBERT: Pre-training Model with Prompt-based Learning for Multimodal Emotion Recognition

Haizhou Li

145

27 Oct 2021

Multimodal Learning using Optimal Transport for Sarcasm and Humor DetectionIEEE Workshop/Winter Conference on Applications of Computer Vision (WACV), 2021

Shraman Pramanick

A. Roy

Vishal M. Patel

205

21 Oct 2021

Toward Accurate and Reliable Iris Segmentation Using Uncertainty Learning

165

20 Oct 2021

Energon: Towards Efficient Acceleration of Transformers Using Dynamic Sparse Attention

268

18 Oct 2021

Understanding Multimodal Procedural Knowledge by Sequencing Multimodal Instructional Manuals

Te-Lin Wu

Alexander Spangher

Pegah Alipoormolabashi

Marjorie Freedman

R. Weischedel

Nanyun Peng

278

16 Oct 2021

Semantically Distributed Robust Optimization for Vision-and-Language Inference

Yezhou Yang

326

14 Oct 2021

A CLIP-Enhanced Method for Video-Language Understanding

127

14 Oct 2021

Multi-Modal Pre-Training for Automated Speech RecognitionIEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2021

234

12 Oct 2021

Vit-GAN: Image-to-image Translation with Vision Transformes and Conditional GANS

Yigit Gündüç

ViT

100

11 Oct 2021

SignBERT: Pre-Training of Hand-Model-Aware Representation for Sign Language RecognitionIEEE International Conference on Computer Vision (ICCV), 2021

263

109

11 Oct 2021

Pretrained Language Models are Symbolic Mathematics Solvers too!

292

07 Oct 2021

Attention is All You Need? Good Embeddings with Statistics are enough:Large Scale Audio Understanding without Transformers/ Convolutions/ BERTs/ Mixers/ Attention/ RNNs or ....

Prateek Verma

AI4TS

281

07 Oct 2021

Tensor-to-Image: Image-to-Image Translation with Vision Transformers

Y. Gündüç

ViT

06 Oct 2021

ProTo: Program-Guided Transformer for Program-Guided Tasks

260

02 Oct 2021

CrossCLR: Cross-modal Contrastive Learning For Multi-modal Video Representations

Mohammadreza Zolfaghari

Yi Zhu

Peter V. Gehler

Thomas Brox

332

148

30 Sep 2021

IntentVizor: Towards Generic Query Guided Interactive Video Summarization

Guande Wu

Jianzhe Lin

Claudio T. Silva

230

30 Sep 2021

VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding

Hu Xu

Gargi Ghosh

Po-Yao (Bernie) Huang

Florian Metze Luke Zettlemoyer Christoph Feichtenhofer

CLIP VLM

830

694

28 Sep 2021

Audio-to-Image Cross-Modal GenerationIEEE International Joint Conference on Neural Network (IJCNN), 2021

Maciej Żelaszczyk

Jacek Mańdziuk

DiffM

202

27 Sep 2021

Self-Supervised Video Representation Learning by Video Incoherence DetectionIEEE Transactions on Cybernetics (IEEE Trans. Cybern.), 2021

Yuecong Xu

Lihua Xie

121

26 Sep 2021

CPT: Colorful Prompt Tuning for Pre-trained Vision-Language Models

Yuan Yao

Ao Zhang

Zhengyan Zhang

Zhiyuan Liu

Tat-Seng Chua

Maosong Sun

MLLM VPVLM VLM

589

244

24 Sep 2021

Pairwise Emotional Relationship Recognition in Drama Videos: Dataset and BenchmarkACM Multimedia (ACM MM), 2021

136

23 Sep 2021

Does Vision-and-Language Pretraining Improve Lexical Grounding?

236

21 Sep 2021

Survey: Transformer based Video-Language Pre-training

Ludan Ruan

Qin Jin

VLM ViT

210

21 Sep 2021

Overview of Tencent Multi-modal Ads Video Understanding Challenge

147

16 Sep 2021

Cross-lingual Transfer of Monolingual Models

256

15 Sep 2021

Can Language Models Encode Perceptual Structure Without Grounding? A Case Study in Color

Daniel Hershcovich

215

159

13 Sep 2021

A Survey on Multi-modal Summarization

206

11 Sep 2021

PlaTe: Visually-Grounded Planning with Transformers in Procedural TasksIEEE Robotics and Automation Letters (RA-L), 2021

De-An Huang

180

10 Sep 2021

M5Product: Self-harmonized Contrastive Learning for E-commercial Multi-modal PretrainingComputer Vision and Pattern Recognition (CVPR), 2021

Michael C. Kampffmeyer

Xiaoyong Wei

Minlong Lu

Yaowei Wang

Xiaodan Liang

586

09 Sep 2021