v1v2 (latest)

VideoBERT: A Joint Model for Video and Language Representation Learning

3 April 2019

Carl Vondrick

Papers citing "VideoBERT: A Joint Model for Video and Language Representation Learning"

50 / 803 papers shown

Nearest Neighbor Future Captioning: Generating Descriptions for Possible Collisions in Object Placement Tasks

243

18 Jul 2024

Missing Modality Prediction for Unpaired Multimodal Learning via Joint Embedding of Unimodal Models

Donggeun Kim

Taesup Kim

265

17 Jul 2024

Meta-optimized Angular Margin Contrastive Framework for Video-Language Representation Learning

See-Kiong Ng

Luu Anh Tuan

479

04 Jul 2024

MIREncoder: Multi-modal IR-based Pretrained Embeddings for Performance Optimizations

Akash Dutta

Ali Jannesari

235

02 Jul 2024

Enhancing Video-Language Representations with Structural Spatio-Temporal Alignment

Hao Fei

Meishan Zhang

277

27 Jun 2024

Multimodal Large Language Models with Fusion Low Rank Adaptation for Device Directed Speech DetectionInterspeech (Interspeech), 2024

Jack Berkowitz

Ahmed Hussen Abdelaziz

Saurabh N. Adya

Ahmed H. Tewfik

VLM

177

13 Jun 2024

ProTrain: Efficient LLM Training via Memory-Aware Techniques

234

12 Jun 2024

Vision Model Pre-training on Interleaved Image-Text Data via Latent Compression Learning

Chenyu Yang

Xizhou Zhu

Jinguo Zhu

Weijie Su

Junjie Wang

...

Lewei Lu

Bin Li

Jie Zhou

Yu Qiao

Jifeng Dai

VLM CLIP

200

11 Jun 2024

AutoTVG: A New Vision-language Pre-training Paradigm for Temporal Video Grounding

Yu-Gang Jiang

267

11 Jun 2024

Video-Language Understanding: A Survey from Model Architecture, Model Training, and Data PerspectivesAnnual Meeting of the Association for Computational Linguistics (ACL), 2024

587

09 Jun 2024

Seeing the Unseen: Visual Metaphor Captioning for VideosConference on Empirical Methods in Natural Language Processing (EMNLP), 2024

Abisek Rajakumar Kalarani

Pushpak Bhattacharyya

Sumit Shekhar

VLM

164

07 Jun 2024

A Survey of Language-Based Communication in Robotics

William Hunt

Sarvapali D. Ramchurn

Mohammad D. Soorati

LM&Ro

711

06 Jun 2024

MuJo: Multimodal Joint Feature Space Learning for Human Activity Recognition

Maximilian Kiefer-Emmanouilidis

Paul Lukowicz

HAI

469

06 Jun 2024

FILS: Self-Supervised Video Feature Prediction In Semantic Language Space

Mona Ahmadian

Frank Guerin

Andrew Gilbert

333

05 Jun 2024

GeminiFusion: Efficient Pixel-wise Multimodal Fusion for Vision Transformer

512

03 Jun 2024

WIDIn: Wording Image for Domain-Invariant Representation in Single-Source Domain Generalization

172

28 May 2024

A Survey on Vision-Language-Action Models for Embodied AI

893

169

23 May 2024

From CNNs to Transformers in Multimodal Human Action Recognition: A Survey

Muhammad Bilal Shaikh

Syed Mohammed Shamsul Islam

Douglas Chai

Naveed Akhtar

347

22 May 2024

A Tale of Two Languages: Large-Vocabulary Continuous Sign Language Recognition from Spoken Language Supervision

326

16 May 2024

PriorCLIP: Visual Prior Guided Vision-Language Model for Remote Sensing Image-Text Retrieval

258

16 May 2024

Unified Video-Language Pre-training with Synchronized Audio

Shentong Mo

Haofan Wang

Huaxia Li

Xu Tang

270

12 May 2024

Learning Object States from Actions via Large Language Models

Masatoshi Tateno

Takuma Yagi

Ryosuke Furuta

Yoichi Sato

136

02 May 2024

Mamba-360: Survey of State Space Models as Transformer Alternative for Long Sequence Modelling: Methods, Applications, and Challenges

Badri N. Patro

Vijay Srinivas Agneeswaran

Mamba

362

24 Apr 2024

SHE-Net: Syntax-Hierarchy-Enhanced Text-Video Retrieval

Ming Yang

404

22 Apr 2024

Towards a Foundation Model for Partial Differential Equations: Multi-Operator Learning and Extrapolation

Jingmin Sun

Yuxuan Liu

Zecheng Zhang

Hayden Schaeffer

AI4CE

406

18 Apr 2024

Multi-Task Multi-Modal Self-Supervised Learning for Facial Expression Recognition

Marah Halawa

Florian Blume

Pia Bideau

Martin Maier

Rasha Abdel Rahman

Olaf Hellwich

CVBM

230

16 Apr 2024

Guided Masked Self-Distillation Modeling for Distributed Multimedia Sensor Event Analysis

273

12 Apr 2024

MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding

Ser-Nam Lim

360

181

08 Apr 2024

Vision Transformers in Domain Adaptation and Generalization: A Study of Robustness

314

05 Apr 2024

Learning Correlation Structures for Vision Transformers

298

05 Apr 2024

SUGAR: Pre-training 3D Visual Representations for RoboticsComputer Vision and Pattern Recognition (CVPR), 2024

258

01 Apr 2024

FSMR: A Feature Swapping Multi-modal Reasoning Approach with Joint Textual and Visual Clues

Shuang Li

Jiahua Wang

Lijie Wen

LRM

151

29 Mar 2024

Enhancing Efficiency in Vision Transformer Networks: Design Techniques and Insights

...

Ehsan Khodapanah Aghdam

Amirhossein Kazerouni

Ilker Hacihaliloglu

Dorit Merhof

304

28 Mar 2024

Multi-Stage Multi-Modal Pre-Training for Automatic Speech Recognition

Venkatesh Ravichandran

Shalini Ghosh

254

28 Mar 2024

Dense Vision Transformer Compression with Few Samples

230

27 Mar 2024

Generative Multi-modal Models are Good Class-Incremental Learners

Ming-Ming Cheng

314

27 Mar 2024

InternVideo2: Scaling Video Foundation Models for Multimodal Video UnderstandingEuropean Conference on Computer Vision (ECCV), 2024

...

Yifei Huang

Yu Qiao

Yali Wang

Limin Wang

262

104

22 Mar 2024

Semantic-Enhanced Representation Learning for Road Networks with Temporal DynamicsIEEE Transactions on Mobile Computing (IEEE TMC), 2024

195

18 Mar 2024

Video Mamba Suite: State Space Model as a Versatile Alternative for Video Understanding

Yifei Huang

279

126

14 Mar 2024

DAM: Dynamic Adapter Merging for Continual Video QA LearningIEEE Workshop/Winter Conference on Applications of Computer Vision (WACV), 2024

Feng Cheng

Ziyang Wang

Yi-Lin Sung

Yan-Bo Lin

Mohit Bansal

Gedas Bertasius

CLL MoMe

361

13 Mar 2024

VideoMamba: State Space Model for Efficient Video UnderstandingEuropean Conference on Computer Vision (ECCV), 2024

Yu Qiao

284

390

11 Mar 2024

Materials science in the era of large language models: a perspectiveDigital Discovery (DD), 2024

Ge Lei

Ronan Docherty

Samuel J. Cooper

230

11 Mar 2024

On the Generalization Ability of Unsupervised PretrainingInternational Conference on Artificial Intelligence and Statistics (AISTATS), 2024

223

11 Mar 2024

CAD-SIGNet: CAD Language Inference from Point Clouds using Layer-wise Sketch Instance Guided Attention

293

27 Feb 2024

Event-aware Video Corpus Moment Retrieval

250

21 Feb 2024

LLMs Meet Long Video: Advancing Long Video Comprehension with An Interactive Visual Adapter in LLMs

265

21 Feb 2024

Video ReCap: Recursive Captioning of Hour-Long Videos

Gedas Bertasius

670

20 Feb 2024

Momentor: Advancing Video Large Language Model with Fine-Grained Temporal Reasoning

Hao Fei

370

100

18 Feb 2024

Revisiting Feature Prediction for Learning Visual Representations from Video

345

177

15 Feb 2024

Comment-aided Video-Language Alignment via Contrastive Pre-training for Short-form Video Humor Detection

263

14 Feb 2024