v1v2 (latest)

VideoBERT: A Joint Model for Video and Language Representation Learning

3 April 2019

Carl Vondrick

Papers citing "VideoBERT: A Joint Model for Video and Language Representation Learning"

50 / 803 papers shown

ERNIE-GeoL: A Geography-and-Language Pre-trained Model and its Applications in Baidu MapsKnowledge Discovery and Data Mining (KDD), 2022

212

17 Mar 2022

Object discovery and representation networksEuropean Conference on Computer Vision (ECCV), 2022

425

16 Mar 2022

Geographic Adaptation of Pretrained Language ModelsTransactions of the Association for Computational Linguistics (TACL), 2022

391

16 Mar 2022

Modular and Parameter-Efficient Multimodal Fusion with PromptingFindings (Findings), 2022

Sheng Liang

Mengjie Zhao

Hinrich Schütze

166

15 Mar 2022

Revitalize Region Feature for Democratizing Video-Language Pre-training of Retrieval

Rui Yan

...

Ying Shan

194

15 Mar 2022

All in One: Exploring Unified Video-Language Pre-trainingComputer Vision and Pattern Recognition (CVPR), 2022

Rui Yan

Ying Shan

316

237

14 Mar 2022

Towards Visual-Prompt Temporal Answering Grounding in Medical Instructional VideoIEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2022

717

13 Mar 2022

Cross-modal Map Learning for Vision and Language NavigationComputer Vision and Pattern Recognition (CVPR), 2022

390

10 Mar 2022

CaSS: A Channel-aware Self-supervised Representation Learning Framework for Multivariate Time Series ClassificationInternational Conference on Database Systems for Advanced Applications (DASFAA), 2022

166

08 Mar 2022

Vision-Language Intelligence: Tasks, Representation Learning, and Large Models

Lei Zhang

212

03 Mar 2022

High-Modality Multimodal Transformer: Quantifying Modality & Interaction Heterogeneity for High-Modality Representation Learning

Shentong Mo

Louis-Philippe Morency

Ruslan Salakhutdinov

230

02 Mar 2022

SGL: Symbolic Goal Learning in a Hybrid, Modular Framework for Human Instruction FollowingIEEE Robotics and Automation Letters (RA-L), 2022

Ruinian Xu

Hongyi Chen

Yunzhi Lin

Patricio A. Vela

171

25 Feb 2022

ISDA: Position-Aware Instance Segmentation with Deformable AttentionIEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2022

228

23 Feb 2022

Movies2Scenes: Using Movie Metadata to Learn Scene RepresentationComputer Vision and Pattern Recognition (CVPR), 2022

228

22 Feb 2022

Multi-view and Multi-modal Event Detection Utilizing Transformer-based Multi-sensor fusionIEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2022

155

18 Feb 2022

AMS_ADRN at SemEval-2022 Task 5: A Suitable Image-text Multimodal Joint Modeling Method for Multi-task Misogyny IdentificationInternational Workshop on Semantic Evaluation (SemEval), 2022

Da Li

Ming Yi

Yukai He

144

18 Feb 2022

VLP: A Survey on Vision-Language Pre-trainingMachine Intelligence Research (MIR), 2022

Minglun Han

396

289

18 Feb 2022

When Did It Happen? Duration-informed Temporal Localization of Narrated Actions in Vlogs

217

16 Feb 2022

Not All Patches are What You Need: Expediting Vision Transformers via Token Reorganizations

369

347

16 Feb 2022

CommerceMM: Large-Scale Commerce MultiModal Representation Learning with Omni RetrievalKnowledge Discovery and Data Mining (KDD), 2022

263

15 Feb 2022

UserBERT: Modeling Long- and Short-Term User Preferences via Self-Supervision

14 Feb 2022

Learning To Recognize Procedural Activities with Distant SupervisionComputer Vision and Pattern Recognition (CVPR), 2022

Gedas Bertasius

260

26 Jan 2022

MGA-VQA: Multi-Granularity Alignment for Visual Question Answering

Peixi Xiong

Yilin Shen

Hongxia Jin

108

25 Jan 2022

Text and Code Embeddings by Contrastive Pre-Training

...

610

538

24 Jan 2022

End-to-end Generative Pretraining for Multimodal Video CaptioningComputer Vision and Pattern Recognition (CVPR), 2022

300

185

20 Jan 2022

Video Transformers: A SurveyIEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2022

458

139

16 Jan 2022

Boundary-aware Self-supervised Learning for Video Scene SegmentationAsian Conference on Computer Vision (ACCV), 2022

Joonseok Lee

161

14 Jan 2022

Pretrained Language Models for Text Generation: A SurveyACM Computing Surveys (ACM CSUR), 2022

525

268

14 Jan 2022

Bridging Video-text Retrieval with Multiple Choice QuestionsComputer Vision and Pattern Recognition (CVPR), 2022

Ying Shan

Ping Luo

296

121

13 Jan 2022

Uni-EDEN: Universal Encoder-Decoder Network by Multi-Granular Vision-Language Pre-training

Yingwei Pan

Tao Mei

222

11 Jan 2022

On the Efficacy of Co-Attention Transformer Layers in Visual Question Answering

Ankur Sikarwar

Gabriel Kreiman

ViT

109

11 Jan 2022

Multi-Query Video RetrievalEuropean Conference on Computer Vision (ECCV), 2022

291

10 Jan 2022

MERLOT Reserve: Neural Script Knowledge through Vision and Language and SoundComputer Vision and Pattern Recognition (CVPR), 2022

Yejin Choi

514

239

07 Jan 2022

Progressive Video Summarization via Multimodal Self-supervised LearningIEEE Workshop/Winter Conference on Applications of Computer Vision (WACV), 2022

336

07 Jan 2022

Discrete and continuous representations and processing in deep learning: Looking forwardAI Open (AO), 2022

301

04 Jan 2022

InverseMV: Composing Piano Scores with a Convolutional Video-Music Transformer

Chin-Tung Lin

Mu Yang

ViT

174

31 Dec 2021

Synchronized Audio-Visual Frames with Fractional Positional Encoding for Transformers in Video-to-Text TranslationInternational Conference on Information Photonics (ICIP), 2021

160

28 Dec 2021

A Survey of Natural Language GenerationACM Computing Surveys (CSUR), 2021

Min Yang

336

22 Dec 2021

Exploiting Long-Term Dependencies for Generating Dynamic Scene GraphsIEEE Workshop/Winter Conference on Applications of Computer Vision (WACV), 2021

278

18 Dec 2021

Align and Prompt: Video-and-Language Pre-training with Entity PromptsComputer Vision and Pattern Recognition (CVPR), 2021

362

214

17 Dec 2021

Contrastive Vision-Language Pre-training with Limited Resources

158

17 Dec 2021

CoCo-BERT: Improving Video-Language Pre-training with Contrastive Cross-modal Matching and Denoising

Yingwei Pan

Tao Mei

161

14 Dec 2021

Short and Long Range Relation Based Spatio-Temporal Transformer for Micro-Expression Recognition

311

10 Dec 2021

Exploring Temporal Granularity in Self-Supervised Video Representation Learning

Ming-Hsuan Yang

200

08 Dec 2021

Suppressing Static Visual Cues via Normalizing Flows for Self-Supervised Video Representation Learning

Manlin Zhang

Jinpeng Wang

A. J. Ma

173

07 Dec 2021

Joint Learning of Localized Representations from Medical Images and ReportsEuropean Conference on Computer Vision (ECCV), 2021

Philipp Muller

Georgios Kaissis

Cong Zou

Daniel Munich

440

113

06 Dec 2021

Uni-Perceiver: Pre-training Unified Architecture for Generic Perception for Zero-shot and Few-shot Tasks

251

152

02 Dec 2021

Video-Text Pre-training with Learned Regions

Rui Yan

261

02 Dec 2021

Routing with Self-Attention for Multimodal Capsule Networks

138

01 Dec 2021

Object-aware Video-language Pre-training for Retrieval

Rui Yan

Ying Shan

286

01 Dec 2021