v1v2 (latest)

VideoBERT: A Joint Model for Video and Language Representation Learning

3 April 2019

Carl Vondrick

Papers citing "VideoBERT: A Joint Model for Video and Language Representation Learning"

50 / 803 papers shown

RA-Rec: An Efficient ID Representation Alignment Framework for LLM-based Recommendation

161

07 Feb 2024

M2-RAAP: A Multi-Modal Recipe for Advancing Adaptation-based Pre-training towards Effective and Efficient Zero-shot Video-text Retrieval

Ming Yang

254

31 Jan 2024

Towards Urban General Intelligence: A Review and Outlook of Urban Foundation Models

554

30 Jan 2024

Cross-Modal Coordination Across a Diverse Set of Input Modalities

Jorge Sánchez

Rodrigo Laguna

VLM

238

29 Jan 2024

Dynamic Transformer Architecture for Continual Learning of Multimodal Tasks

Yuliang Cai

Mohammad Rostami

357

27 Jan 2024

Multimodal Pathway: Improve Transformers with Irrelevant Data from Other ModalitiesComputer Vision and Pattern Recognition (CVPR), 2024

Ying Shan

312

25 Jan 2024

CoAVT: A Cognition-Inspired Unified Audio-Visual-Text Pre-Training Model for Multimodal ProcessingIEEE Transactions on Audio, Speech, and Language Processing (IEEE TASLP), 2024

Xianghu Yue

Haizhou Li

235

22 Jan 2024

DGL: Dynamic Global-Local Prompt Tuning for Text-Video Retrieval

Yi Yang

296

19 Jan 2024

Collaboratively Self-supervised Video Representation Learning for Action RecognitionIEEE Transactions on Information Forensics and Security (IEEE TIFS), 2024

376

15 Jan 2024

Video Understanding with Large Language Models: A Survey

...

712

163

29 Dec 2023

Data-Efficient Multimodal Fusion on a Single GPUComputer Vision and Pattern Recognition (CVPR), 2023

460

15 Dec 2023

Audio-Visual LLM for Video Understanding

Lei Zhang

240

11 Dec 2023

A Video is Worth 10,000 Words: Training and Benchmarking with Diverse Captions for Better Long Video RetrievalIEEE Workshop/Winter Conference on Applications of Computer Vision (WACV), 2023

288

30 Nov 2023

Spacewalk-18: A Benchmark for Multimodal and Long-form Procedural Video Understanding in Novel Domains

496

30 Nov 2023

E-ViLM: Efficient Video-Language Model via Masked Video Modeling with Semantic Vector-Quantized Tokenizer

392

28 Nov 2023

Mitigating Object Hallucinations in Large Vision-Language Models through Visual Contrastive DecodingComputer Vision and Pattern Recognition (CVPR), 2023

Xin Li

311

442

28 Nov 2023

Vamos: Versatile Action Models for Video UnderstandingEuropean Conference on Computer Vision (ECCV), 2023

Shijie Wang

389

22 Nov 2023

SPOT! Revisiting Video-Language Models for Event Understanding

Jindong Gu

443

21 Nov 2023

Advancing Drug Discovery with Enhanced Chemical Understanding via Asymmetric Contrastive Multimodal LearningJournal of Chemical Information and Modeling (JCIM), 2023

371

11 Nov 2023

Towards A Unified Neural Architecture for Visual Recognition and Reasoning

163

10 Nov 2023

CLearViD: Curriculum Learning for Video Description

Cheng-Yu Chuang

Pooyan Fazli

152

08 Nov 2023

A Single 2D Pose with Context is Worth Hundreds for 3D Human Pose EstimationNeural Information Processing Systems (NeurIPS), 2023

231

06 Nov 2023

ACQUIRED: A Dataset for Answering Counterfactual Questions In Real-Life VideosConference on Empirical Methods in Natural Language Processing (EMNLP), 2023

Nischal Reddy Chandra

Marjorie Freedman

R. Weischedel

Nanyun Peng

282

02 Nov 2023

Object-centric Video Representation for Long-term Action AnticipationIEEE Workshop/Winter Conference on Applications of Computer Vision (WACV), 2023

Shijie Wang

279

31 Oct 2023

Harvest Video Foundation Models via Efficient Post-Pretraining

Yu Qiao

Ping Luo

CLIP VLM VGen

350

30 Oct 2023

Generating Context-Aware Natural Answers for Questions in 3D ScenesBritish Machine Vision Conference (BMVC), 2023

Mohammed Munzer Dwedari

Matthias Niessner

Dave Zhenyu Chen

194

30 Oct 2023

MOSEL: Inference Serving Using Dynamic Modality SelectionConference on Empirical Methods in Natural Language Processing (EMNLP), 2023

300

27 Oct 2023

ArchBERT: Bi-Modal Understanding of Neural Architectures and Natural LanguagesConference on Computational Natural Language Learning (CoNLL), 2023

Mohammad Akbari

Saeed Ranjbar Alvar

Behnam Kamranian

Amin Banitalebi-Dehkordi

Yong Zhang

AI4CE

139

26 Oct 2023

Exploring Iterative Refinement with Diffusion Models for Video GroundingIEEE International Conference on Multimedia and Expo (ICME), 2023

267

26 Oct 2023

CAD -- Contextual Multi-modal Alignment for Dynamic AVQAIEEE Workshop/Winter Conference on Applications of Computer Vision (WACV), 2023

302

25 Oct 2023

FloCoDe: Unbiased Dynamic Scene Graph Generation with Temporal Consistency and Correlation Debiasing

Anant Khandelwal

459

24 Oct 2023

UNK-VQA: A Dataset and a Probe into the Abstention Ability of Multi-modal Large ModelsIEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2023

389

17 Oct 2023

GePSAn: Generative Procedure Step Anticipation in Cooking VideosIEEE International Conference on Computer Vision (ICCV), 2023

M. A. Abdelsalam

Samrudhdhi B. Rangrej

Isma Hadji

Nikita Dvornik

Konstantinos G. Derpanis

Afsaneh Fazly

AI4TS

217

12 Oct 2023

Latent Wander: an Alternative Interface for Interactive and Serendipitous Discovery of Large AV Archives

Yuchen Yang

Linyida Zhang

219

09 Oct 2023

HowToCaption: Prompting LLMs to Transform Video Annotations at ScaleEuropean Conference on Computer Vision (ECCV), 2023

Nina Shvetsova

Anna Kukleva

Xudong Hong

Christian Rupprecht

Bernt Schiele

Hilde Kuehne

297

07 Oct 2023

CLEVRER-Humans: Describing Physical and Causal Events the Human WayNeural Information Processing Systems (NeurIPS), 2023

333

05 Oct 2023

GRID: A Platform for General Robot Intelligence Development

271

02 Oct 2023

Skip-Plan: Procedure Planning in Instructional Videos via Condensed Action Space LearningIEEE International Conference on Computer Vision (ICCV), 2023

Lei Chen

Jie Zhou

178

01 Oct 2023

PROSE: Predicting Operators and Symbolic Expressions using Multimodal Transformers

Yuxuan Liu

Zecheng Zhang

Hayden Schaeffer

211

28 Sep 2023

Social Media Fashion Knowledge Extraction as Captioning

180

28 Sep 2023

ELIP: Efficient Language-Image Pre-training with Fewer Vision Tokens

Haoyu Zhang

238

28 Sep 2023

Zero-Shot and Few-Shot Video Question Answering with Multi-Modal Prompts

Bipin Rajendran

Bashir M. Al-Hashimi

MLLM VLM

253

27 Sep 2023

VidChapters-7M: Video Chapters at ScaleNeural Information Processing Systems (NeurIPS), 2023

246

25 Sep 2023

Revisiting Kernel Temporal Segmentation as an Adaptive Tokenizer for Long-form Video Understanding

Sernam Lim

195

20 Sep 2023

Dual-Modal Attention-Enhanced Text-Video Retrieval with Triplet Partial Margin Contrastive LearningACM Multimedia (ACM MM), 2023

...

366

20 Sep 2023

Collaborative Three-Stream Transformers for Video CaptioningComputer Vision and Image Understanding (CVIU), 2023

193

18 Sep 2023

Disentangling Spatial and Temporal Learning for Efficient Image-to-Video Transfer LearningIEEE International Conference on Computer Vision (ICCV), 2023

216

14 Sep 2023

Unified Pre-training with Pseudo Texts for Text-To-Image Person Re-identificationIEEE International Conference on Computer Vision (ICCV), 2023

Jingdong Wang

239

04 Sep 2023

COMEDIAN: Self-Supervised Learning and Knowledge Distillation for Action Spotting using Transformers

289

03 Sep 2023

A Fine-Grained Image Description Generation Method Based on Joint ObjectivesChinese Conference on Computer Supported Cooperative Work and Social Computing (SCWSC), 2023

123

02 Sep 2023