v1v2 (latest)

VideoBERT: A Joint Model for Video and Language Representation Learning

3 April 2019

Carl Vondrick

Papers citing "VideoBERT: A Joint Model for Video and Language Representation Learning"

50 / 803 papers shown

MGeo: Multi-Modal Geographic Pre-Training MethodAnnual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), 2023

Fei Huang

273

11 Jan 2023

Universal Multimodal Representation for Language UnderstandingIEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2023

Rui Wang

291

09 Jan 2023

MAQA: A Multimodal QA Benchmark for Negation

Dima Kuzmin

216

09 Jan 2023

Logically at Factify 2: A Multi-Modal Fact Checking System Based on Evidence Retrieval techniques and Transformer Encoder Architecture

269

09 Jan 2023

Test of Time: Instilling Video-Language Models with a Sense of TimeComputer Vision and Pattern Recognition (CVPR), 2023

Piyush Bagad

Makarand Tapaswi

Cees G. M. Snoek

465

05 Jan 2023

Learning Trajectory-Word Alignments for Video-Language TasksIEEE International Conference on Computer Vision (ICCV), 2023

Fei Huang

215

05 Jan 2023

NaQ: Leveraging Narrations as Queries to Supervise Episodic MemoryComputer Vision and Pattern Recognition (CVPR), 2023

Santhosh Kumar Ramakrishnan

Ziad Al-Halah

Kristen Grauman

372

02 Jan 2023

HiTeA: Hierarchical Temporal-Aware Video-Language Pre-trainingIEEE International Conference on Computer Vision (ICCV), 2022

Ji Zhang

Fei Huang

VLM AI4TS

547

30 Dec 2022

Emotion Recognition with Pre-Trained Transformers Using Multimodal SignalsAffective Computing and Intelligent Interaction (ACII), 2022

Juan Vazquez-Rodriguez

G. Lefebvre

Julien Cumin

James L. Crowley

195

22 Dec 2022

VindLU: A Recipe for Effective Video-and-Language PretrainingComputer Vision and Pattern Recognition (CVPR), 2022

Gedas Bertasius

290

09 Dec 2022

Tencent AVS: A Holistic Ads Video Dataset for Multi-modal Scene SegmentationIEEE Access (IEEE Access), 2022

Wei Liu

199

09 Dec 2022

Learning Video Representations from Large Language ModelsComputer Vision and Pattern Recognition (CVPR), 2022

307

231

08 Dec 2022

SimVTP: Simple Video Text Pre-training with Masked Autoencoders

Yue Ma

Tianyu Yang

Yin Shan

Xiu Li

169

07 Dec 2022

InternVideo: General Video Foundation Models via Generative and Discriminative Learning

Yi Wang

...

Yu Qiao

466

448

06 Dec 2022

Muscles in ActionIEEE International Conference on Computer Vision (ICCV), 2022

Mia Chiquier

Carl Vondrick

319

05 Dec 2022

Day2Dark: Pseudo-Supervised Activity Recognition beyond Silent DaylightInternational Journal of Computer Vision (IJCV), 2022

303

05 Dec 2022

Masked Contrastive Pre-Training for Efficient Video-Text Retrieval

190

02 Dec 2022

Protein Language Models and Structure Prediction: Connection and Progression

Cheng Tan

Stan Z. Li

220

30 Nov 2022

SegCLIP: Patch Aggregation with Learnable Centers for Open-Vocabulary Semantic SegmentationInternational Conference on Machine Learning (ICML), 2022

Tianrui Li

248

196

27 Nov 2022

Perceiver-VL: Efficient Vision-and-Language Modeling with Iterative Latent AttentionIEEE Workshop/Winter Conference on Applications of Computer Vision (WACV), 2022

179

21 Nov 2022

Expectation-Maximization Contrastive Learning for Compact Video-and-Language RepresentationsNeural Information Processing Systems (NeurIPS), 2022

307

21 Nov 2022

Towards All-in-one Pre-training via Maximizing Multi-modal Mutual InformationComputer Vision and Pattern Recognition (CVPR), 2022

Weijie Su

Gao Huang

Yu Qiao

Xiaogang Wang

Jie Zhou

Jifeng Dai

245

17 Nov 2022

Language Conditioned Spatial Relation Reasoning for 3D Object GroundingNeural Information Processing Systems (NeurIPS), 2022

266

128

17 Nov 2022

Cross-Modal Adapter for Vision-Language RetrievalPattern Recognition (Pattern Recogn.), 2022

368

17 Nov 2022

Pragmatics in Language Grounding: Phenomena, Tasks, and Modeling ApproachesConference on Empirical Methods in Natural Language Processing (EMNLP), 2022

Daniel Fried

250

15 Nov 2022

Grafting Pre-trained Models for Multimodal Headline GenerationConference on Empirical Methods in Natural Language Processing (EMNLP), 2022

Di Yin

248

14 Nov 2022

CLOP: Video-and-Language Pre-Training with Knowledge RegularizationsACM Multimedia (ACM MM), 2022

178

07 Nov 2022

CASA: Category-agnostic Skeletal Animal ReconstructionNeural Information Processing Systems (NeurIPS), 2022

Shenlong Wang

261

04 Nov 2022

Scaling Multimodal Pre-Training via Cross-Modality Gradient HarmonizationNeural Information Processing Systems (NeurIPS), 2022

153

03 Nov 2022

DiMBERT: Learning Vision-Language Grounded Representations with Disentangled Multimodal-AttentionACM Transactions on Knowledge Discovery from Data (TKDD), 2021

Xuancheng Ren

Yuexian Zou

209

28 Oct 2022

End-to-End Multimodal Representation Learning for Video Dialog

213

26 Oct 2022

Fine-grained Semantic Alignment Network for Weakly Supervised Temporal Language GroundingConference on Empirical Methods in Natural Language Processing (EMNLP), 2022

155

21 Oct 2022

LiteVL: Efficient Video-Language Learning with Enhanced Spatial-Temporal ModelingConference on Empirical Methods in Natural Language Processing (EMNLP), 2022

Dongsheng Chen

Chaofan Tao

Lu Hou

Lifeng Shang

Xin Jiang

Qun Liu

VLM

251

21 Oct 2022

H4VDM: H.264 Video Device Matching

116

20 Oct 2022

Visual Spatial Description: Controlled Spatial-Oriented Image-to-Text GenerationConference on Empirical Methods in Natural Language Processing (EMNLP), 2022

193

20 Oct 2022

Grounded Video Situation RecognitionNeural Information Processing Systems (NeurIPS), 2022

Zeeshan Khan

C. V. Jawahar

Makarand Tapaswi

192

19 Oct 2022

VTC: Improving Video-Text Retrieval with User CommentsEuropean Conference on Computer Vision (ECCV), 2022

Christian Rupprecht

245

19 Oct 2022

Video in 10 Bits: Few-Bit VideoQA for Efficiency and Privacy

202

15 Oct 2022

Plausible May Not Be Faithful: Probing Object Hallucination in Vision-Language Pre-trainingConference of the European Chapter of the Association for Computational Linguistics (EACL), 2022

313

14 Oct 2022

Can Language Representation Models Think in Bets?Royal Society Open Science (RSOS), 2022

Zhi–Bin Tang

Mayank Kejriwal

159

14 Oct 2022

RaP: Redundancy-aware Video-language Pre-training for Text-Video RetrievalConference on Empirical Methods in Natural Language Processing (EMNLP), 2022

162

13 Oct 2022

Long-Form Video-Language Pre-Training with Multimodal Temporal Contrastive LearningNeural Information Processing Systems (NeurIPS), 2022

297

12 Oct 2022

Contrastive Video-Language Learning with Fine-grained Frame Sampling

Yujie Zhong

233

10 Oct 2022

Generating Executable Action Plans with Environmentally-Aware Language ModelsIEEE/RJS International Conference on Intelligent RObots and Systems (IROS), 2022

Maitrey Gramopadhye

D. Szafir

LM&Ro LLMAG

325

10 Oct 2022

ConTra: (Con)text (Tra)nsformer for Cross-Modal Video RetrievalAsian Conference on Computer Vision (ACCV), 2022

A. Fragomeni

Michael Wray

Dima Damen

CLIP ViT

158

09 Oct 2022

Learning Fine-Grained Visual Understanding for Video Question Answering via Decoupling Spatial-Temporal ModelingBritish Machine Vision Conference (BMVC), 2022

Hsin-Ying Lee

Hung-Ting Su

312

08 Oct 2022

See, Plan, Predict: Language-guided Cognitive Planning with Video Prediction

213

07 Oct 2022

Visualize Before You Write: Imagination-Guided Open-Ended Text GenerationFindings (Findings), 2022

324

07 Oct 2022

Understanding Prior Bias and Choice Paralysis in Transformer-based Language Representation Models through Four Experimental Probes

Ke Shen

Mayank Kejriwal

192

03 Oct 2022

CLIP2Point: Transfer CLIP to Point Cloud Classification with Image-Depth Pre-trainingIEEE International Conference on Computer Vision (ICCV), 2022

Bowen Dong

Xiaoshui Huang

Wanli Ouyang

580

199

03 Oct 2022