v1v2 (latest)

VideoChat: Chat-Centric Video Understanding

10 May 2023

Yi Wang

Ping Luo

Yu Qiao

ArXiv (abs)PDF HTML HuggingFace (3 upvotes)Github (3246★)

Papers citing "VideoChat: Chat-Centric Video Understanding"

50 / 563 papers shown

DoLLM: How Large Language Models Understanding Network Flow Data to Detect Carpet Bombing DDoS

238

13 May 2024

How Good is my Video LMM? Complex Video Reasoning and Robustness Evaluation Suite for Video-LMMs

Muhammad Uzair Khattak

Muhammad Ferjad Naeem

Jameel Hassan

Muzammal Naseer

Federico Tombari

Fahad Shahbaz Khan

Salman Khan

LRM ELM

291

06 May 2024

WorldQA: Multimodal World Knowledge in Videos through Long-Chain Reasoning

Christopher Arif Setiadharma

Jingkang Yang

Ziwei Liu

VGen

229

06 May 2024

Octopi: Object Property Reasoning with Large Tactile-Language Models

406

05 May 2024

Uncovering What, Why and How: A Comprehensive Benchmark for Causation Understanding of Video Anomaly

...

222

30 Apr 2024

MovieChat+: Question-aware Sparse Memory for Long Video Question Answering

Xi Li

256

26 Apr 2024

MER 2024: Semi-Supervised Learning, Noise Robustness, and Open-Vocabulary Multimodal Emotion Recognition

...

Björn W. Schuller

394

26 Apr 2024

PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning

See Kiong Ng

274

283

25 Apr 2024

How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites

...

Dahua Lin

Yu Qiao

Jifeng Dai

Wenhai Wang

MLLM VLM

534

1,004

25 Apr 2024

SEED-Bench-2-Plus: Benchmarking Multimodal Large Language Models with Text-Rich Visual Comprehension

Ying Shan

211

106

25 Apr 2024

Pegasus-v1 Technical Report

...

105

23 Apr 2024

Graphic Design with Large Multimodal Model

330

22 Apr 2024

From Image to Video, what do we need in multimodal LLMs?

303

18 Apr 2024

HOI-Ref: Hand-Object Interaction Referral in Egocentric Vision

Siddhant Bansal

Michael Wray

Dima Damen

219

15 Apr 2024

Learning to Localize Objects Improves Spatial Reasoning in Visual-LLMs

211

11 Apr 2024

MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding

Ser-Nam Lim

362

184

08 Apr 2024

JRDB-Social: A Multifaceted Robotic Dataset for Understanding of Context and Dynamics of Human Interactions Within Social Groups

Simindokht Jahangard

Zhixi Cai

Shiki Wen

Hamid Rezatofighi

184

06 Apr 2024

Koala: Key frame-conditioned long video-LLM

371

05 Apr 2024

SemGrasp: Semantic Grasp Generation via Language Aligned DiscretizationEuropean Conference on Computer Vision (ECCV), 2024

259

04 Apr 2024

LongVLM: Efficient Long Video Understanding via Large Language ModelsEuropean Conference on Computer Vision (ECCV), 2024

Yuetian Weng

Mingfei Han

Haoyu He

Xiaojun Chang

Bohan Zhuang

VLM

372

128

04 Apr 2024

Direct Preference Optimization of Video Large Multimodal Models from Language Model Reward

...

Alexander G. Hauptmann

Yonatan Bisk

Yiming Yang

MLLM

383

124

01 Apr 2024

ST-LLM: Large Language Models Are Effective Temporal Learners

Ying Shan

210

125

30 Mar 2024

LITA: Language Instructed Temporal-Localization Assistant

De-An Huang

Shijia Liao

Subhashree Radhakrishnan

241

104

27 Mar 2024

An Image Grid Can Be Worth a Video: Zero-shot Video Question Answering Using a VLM

240

27 Mar 2024

Visual CoT: Advancing Multi-Modal Language Models with a Comprehensive Dataset and Benchmark for Chain-of-Thought Reasoning

Han Xiao

370

217

25 Mar 2024

Elysium: Exploring Object-level Perception in Videos via MLLM

327

25 Mar 2024

Empowering LLMs with Pseudo-Untrimmed Videos for Audio-Visual Temporal UnderstandingAAAI Conference on Artificial Intelligence (AAAI), 2024

387

24 Mar 2024

InternVideo2: Scaling Video Foundation Models for Multimodal Video UnderstandingEuropean Conference on Computer Vision (ECCV), 2024

...

Yifei Huang

Yu Qiao

Yali Wang

Limin Wang

273

104

22 Mar 2024

FMM-Attack: A Flow-based Multi-modal Adversarial Attack on Video-based LLMs

Kuofeng Gao

245

20 Mar 2024

RelationVLM: Making Large Vision-Language Models Understand Visual Relations

155

19 Mar 2024

Contextual AD Narration with Interleaved Multimodal SequenceComputer Vision and Pattern Recognition (CVPR), 2024

483

19 Mar 2024

HawkEye: Training Video-Text LLMs for Grounding Text in Videos

Yueqian Wang

Xiaojun Meng

Jianxin Liang

Yuxuan Wang

Qun Liu

Dongyan Zhao

230

15 Mar 2024

GPT as Psychologist? Preliminary Evaluations for GPT-4V on Visual Affective Computing

Jiyao Wang

...

Dengbo He

277

09 Mar 2024

CAT: Enhancing Multimodal Large Language Model to Answer Questions in Dynamic Audio-Visual ScenariosEuropean Conference on Computer Vision (ECCV), 2024

413

07 Mar 2024

Embodied Understanding of Driving ScenariosEuropean Conference on Computer Vision (ECCV), 2024

Yu Qiao

255

07 Mar 2024

GPTSee: Enhancing Moment Retrieval and Highlight Detection via Description-Based Similarity Features

Sidan Du

331

03 Mar 2024

Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers

...

Hsin-Ying Lee

Ming-Hsuan Yang

371

343

29 Feb 2024

The All-Seeing Project V2: Towards General Relation Comprehension of the Open World

...

Yu Qiao

319

29 Feb 2024

Navigating Hallucinations for Reasoning of Unintentional Activities

318

29 Feb 2024

Percept, Chat, and then Adapt: Multimodal Knowledge Transfer of Foundation Models for Open-World Video Recognition

Yu Qiao

249

29 Feb 2024

OSCaR: Object State Captioning and State Change Representation

552

27 Feb 2024

PhyGrasp: Generalizing Robotic Grasping with Physics-informed Large Multimodal Models

Masayoshi Tomizuka

Mingyu Ding

Wei Zhan

249

26 Feb 2024

RoboCodeX: Multimodal Code Generation for Robotic Behavior Synthesis

...

Yu Qiao

Mingyu Ding

Ping Luo

253

25 Feb 2024

Slot-VLM: SlowFast Slots for Video-Language Modeling

151

20 Feb 2024

Momentor: Advancing Video Large Language Model with Fine-Grained Temporal Reasoning

Hao Fei

385

100

18 Feb 2024

World Model on Million-Length Video And Language With Blockwise RingAttention

Pieter Abbeel

739

144

13 Feb 2024

Prismatic VLMs: Investigating the Design Space of Visually-Conditioned Language Models

Dorsa Sadigh

275

240

12 Feb 2024

Memory Consolidation Enables Long-Context Video Understanding

464

08 Feb 2024

SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models

...

Yu Qiao

543

139

08 Feb 2024

Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional TokenizationInternational Conference on Machine Learning (ICML), 2024

Kun Xu

...

267

05 Feb 2024