Video-of-Thought: Step-by-Step Video Reasoning from Perception to Cognition

International Conference on Machine Learning (ICML), 2024

8 January 2025

Papers citing "Video-of-Thought: Step-by-Step Video Reasoning from Perception to Cognition"

50 / 91 papers shown

From Perception to Reasoning: Deep Thinking Empowers Multimodal Large Language Models

450

17 Nov 2025

UniVA: Universal Video Agent towards Open-Source Next-Generation Video Generalist

...

202

11 Nov 2025

Enhancing Multimodal Reasoning via Latent Refocusing

178

04 Nov 2025

StreamingCoT: A Dataset for Temporal Dynamics and Multimodal Chain-of-Thought Reasoning in Streaming VideoQA

150

29 Oct 2025

Perception, Understanding and Reasoning, A Multimodal Benchmark for Video Fake News Detection

28 Oct 2025

MUStReason: A Benchmark for Diagnosing Pragmatic Reasoning in Video-LMs for Multimodal Sarcasm Detection

27 Oct 2025

Video-Thinker: Sparking "Thinking with Videos" via Reinforcement Learning

132

27 Oct 2025

SceneCOT: Eliciting Grounded Chain-of-Thought Reasoning in 3D Scenes

166

19 Oct 2025

Select Less, Reason More: Prioritizing Evidence Purity for Video Reasoning

17 Oct 2025

When Thinking Drifts: Evidential Grounding for Robust Video Reasoning

261

07 Oct 2025

Beyond Isolated Facts: Synthesizing Narrative and Grounded Supervision for VideoQA

100

29 Sep 2025

MOSS-ChatV: Reinforcement Learning with Process Reasoning Reward for Video Temporal Reasoning

...

393

25 Sep 2025

Citrus-V: Advancing Medical Foundation Models with Unified Medical Image Grounding for Clinical Reasoning

...

195

23 Sep 2025

LEAF-Mamba: Local Emphatic and Adaptive Fusion State Space Model for RGB-D Salient Object Detection

179

23 Sep 2025

3D Aware Region Prompted Vision Language Model

...

139

16 Sep 2025

FineQuest: Adaptive Knowledge-Assisted Sports Video Understanding via Agent-of-Thoughts Reasoning

174

15 Sep 2025

Dr.V: A Hierarchical Perception-Temporal-Cognition Framework to Diagnose Video Hallucination by Fine-grained Spatial-Temporal Grounding

...

169

15 Sep 2025

AdsQA: Towards Advertisement Video Understanding

...

144

10 Sep 2025

A Comprehensive Survey on Trustworthiness in Reasoning with Large Language Models

205

04 Sep 2025

Why Do MLLMs Struggle with Spatial Understanding? A Systematic Analysis from Data to Architecture

113

02 Sep 2025

ProPy: Building Interactive Prompt Pyramids upon CLIP for Partially Relevant Video Retrieval

Yi Pan

Yujia Zhang

Michael C. Kampffmeyer

Xiaoguang Zhao

116

26 Aug 2025

See What You Need: Query-Aware Visual Intelligence through Reasoning-Perception Loops

25 Aug 2025

EGOILLUSION: Benchmarking Hallucinations in Egocentric Video Understanding

Ashish Seth

Utkarsh Tyagi

Ramaneswaran Selvakumar

221

18 Aug 2025

Empowering Multimodal LLMs with External Tools: A Comprehensive Survey

181

14 Aug 2025

Episodic Memory Representation for Long-form Video Understanding

130

13 Aug 2025

Thinking With Videos: Multimodal Tool-Augmented Reinforcement Learning for Long Video Reasoning

200

06 Aug 2025

StreamAgent: Towards Anticipatory Agents for Streaming Video Understanding

...

343

03 Aug 2025

CausalStep: A Benchmark for Explicit Stepwise Causal Reasoning in Videos

265

22 Jul 2025

LeAdQA: LLM-Driven Context-Aware Temporal Grounding for Video Question Answering

193

20 Jul 2025

Video-RTS: Rethinking Reinforcement Learning and Test-Time Scaling for Efficient and Enhanced Video Reasoning

258

09 Jul 2025

Cautious Next Token Prediction

214

03 Jul 2025

Chiron-o1: Igniting Multimodal Large Language Models towards Generalizable Medical Reasoning via Mentor-Intern Collaborative Search

307

20 Jun 2025

DAVID-XR1: Detecting AI-Generated Videos with Explainable Reasoning

...

335

13 Jun 2025

VFaith: Do Large Multimodal Models Really Reason on Seen Images Rather than Previous Memories?

213

13 Jun 2025

VRBench: A Benchmark for Multi-Step Reasoning in Long Narrative Videos

...

457

12 Jun 2025

What Limits Virtual Agent Application? OmniBench: A Scalable Multi-Dimensional Benchmark for Essential Virtual Agent Capabilities

...

215

10 Jun 2025

Uncertainty-o: One Model-agnostic Framework for Unveiling Uncertainty in Large Multimodal Models

270

09 Jun 2025

Video-Skill-CoT: Skill-based Chain-of-Thoughts for Domain-Adaptive Video Reasoning

318

04 Jun 2025

Argus Inspection: Do Multimodal Large Language Models Possess the Eye of Panoptes?

...

295

03 Jun 2025

Chain-of-Frames: Advancing Video Understanding in Multimodal LLMs via Frame-Aware Reasoning

Sara Ghazanfari

Francesco Croce

Nicolas Flammarion

Prashanth Krishnamurthy

Farshad Khorrami

S. Garg

LRM

182

31 May 2025

SiLVR: A Simple Language-based Video Reasoning Framework

182

30 May 2025

ViQAgent: Zero-Shot Video Question Answering via Agent with Open-Vocabulary Grounding Validation

Tony Montes

Fernando Lozano

336

21 May 2025

ViC-Bench: Benchmarking Visual-Interleaved Chain-of-Thought Capability in MLLMs with Free-Style Intermediate State Representations

...

Hairong Dong

Dingkang Yang

LRM

353

20 May 2025

VISTA: Mitigating Semantic Inertia in Video-LLMs via Training-Free Dynamic Chain-of-Thought Routing

376

17 May 2025

RAVU: Retrieval Augmented Video Understanding with Compositional Reasoning over Graph

1.0K

06 May 2025

MINERVA: Evaluating Complex Video Reasoning

...

336

01 May 2025

Reinforced MLLM: A Survey on RL-Based Reasoning in Multimodal Large Language Models

541

30 Apr 2025

Embodied-R: Collaborative Framework for Activating Embodied Spatial Reasoning in Foundation Models via Reinforcement Learning

321

17 Apr 2025

VistaDPO: Video Hierarchical Spatial-Temporal Direct Preference Optimization for Large Video Models

970

17 Apr 2025

REVEAL: Relation-based Video Representation Learning for Video-Question-Answering

880

07 Apr 2025