v1v2 (latest)

ST-Think: How Multimodal Large Language Models Reason About 4D Worlds from Ego-Centric Videos

16 March 2025

ArXiv (abs)PDF HTML HuggingFace (1 upvotes)

Papers citing "ST-Think: How Multimodal Large Language Models Reason About 4D Worlds from Ego-Centric Videos"

39 / 39 papers shown

SpatialBench: Benchmarking Multimodal Large Language Models for Spatial Cognition

339

26 Nov 2025

Video2Layout: Recall and Reconstruct Metric-Grounded Cognitive Map for Spatial Reasoning

201

20 Nov 2025

Multimodal Spatial Reasoning in the Large Model Era: A Survey and Benchmarks

...

721

29 Oct 2025

Video-LMM Post-Training: A Deep Dive into Video Reasoning with Large Multimodal Models

...

742

06 Oct 2025

AgentCaster: Reasoning-Guided Tornado Forecasting

Michael Chen

LLMAG LRM AI4CE

150

02 Oct 2025

How Far are VLMs from Visual Spatial Intelligence? A Benchmark-Driven Perspective

...

319

23 Sep 2025

UGC-VideoCaptioner: An Omni UGC Video Detail Caption Model and New Benchmarks

210

15 Jul 2025

DeepVideo-R1: Video Reinforcement Fine-Tuning via Difficulty-aware Regressive GRPO

357

09 Jun 2025

EgoVLM: Policy Optimization for Egocentric Video Understanding

215

03 Jun 2025

SiLVR: A Simple Language-based Video Reasoning Framework

185

30 May 2025

Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence

283

29 May 2025

VerIPO: Cultivating Long Reasoning in Video-LLMs via Verifier-Gudied Iterative Policy Optimization

395

25 May 2025

MIRAGE: A Multi-modal Benchmark for Spatial Perception, Reasoning, and Intelligence

406

15 May 2025

EchoInk-R1: Exploring Audio-Visual Reasoning in Multimodal LLMs via Reinforcement Learning

347

07 May 2025

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

...

OffRL AI4TS LRM ReLM VLM

1.2K

5,342

22 Jan 2025

X-LeBench: A Benchmark for Extremely Long Egocentric Video UnderstandingConference on Empirical Methods in Natural Language Processing (EMNLP), 2025

...

207

12 Jan 2025

Building a Mind Palace: Structuring Environment-Grounded Semantic Graphs for Effective Long Video Analysis with LLMsComputer Vision and Pattern Recognition (CVPR), 2025

...

198

08 Jan 2025

Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall SpacesComputer Vision and Pattern Recognition (CVPR), 2024

519

341

18 Dec 2024

EgoPlan-Bench2: A Benchmark for Multimodal Large Language Model Planning in Real-World Scenarios

416

05 Dec 2024

HourVideo: 1-Hour Video-Language UnderstandingNeural Information Processing Systems (NeurIPS), 2024

Keshigeyan Chandrasegaran

290

07 Nov 2024

FMBench: Benchmarking Fairness in Multimodal Large Language Models on Medical Tasks

Jun Li

245

01 Oct 2024

AMEGO: Active Memory from long EGOcentric videosEuropean Conference on Computer Vision (ECCV), 2024

Tushar Nagarajan

241

17 Sep 2024

LLaVA-OneVision: Easy Visual Task Transfer

Bo Li

Yuanhan Zhang

Dong Guo

Renrui Zhang

Feng Li

Hao Zhang

Kaichen Zhang

Yanwei Li

Ziwei Liu

Chunyuan Li

MLLM SyDa VLM

567

1,747

06 Aug 2024

LongVideoBench: A Benchmark for Long-context Interleaved Video-Language Understanding

250

355

22 Jul 2024

Nymeria: A Massive Collection of Multimodal Egocentric Daily Motion in the WildEuropean Conference on Computer Vision (ECCV), 2024

Fangzhou Hong

...

C. Karen Liu

Ziwei Liu

Jakob Engel

R. D. Nardi

Richard Newcombe

250

14 Jun 2024

SpatialRGPT: Grounded Spatial Reasoning in Vision Language Model

Xiaolong Wang

278

185

03 Jun 2024

Aria Everyday Activities Dataset

...

Jakob Julian Engel

173

20 Feb 2024

SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning CapabilitiesComputer Vision and Pattern Recognition (CVPR), 2024

Dorsa Sadigh

323

538

22 Jan 2024

EgoSchema: A Diagnostic Benchmark for Very Long-form Video Language UnderstandingNeural Information Processing Systems (NeurIPS), 2023

K. Mangalam

Raiymbek Akshulakov

Jitendra Malik

402

495

17 Aug 2023

Forward-Backward Reasoning in Large Language Models for Mathematical VerificationAnnual Meeting of the Association for Computational Linguistics (ACL), 2023

483

15 Aug 2023

Aria Digital Twin: A New Benchmark Dataset for Egocentric 3D Machine PerceptionIEEE International Conference on Computer Vision (ICCV), 2023

380

105

10 Jun 2023

Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language ModelsAnnual Meeting of the Association for Computational Linguistics (ACL), 2023

Salman Khan

450

953

08 Jun 2023

HOI4D: A 4D Egocentric Dataset for Category-Level Human-Object InteractionComputer Vision and Pattern Recognition (CVPR), 2022

475

263

03 Mar 2022

Chain-of-Thought Prompting Elicits Reasoning in Large Language ModelsNeural Information Processing Systems (NeurIPS), 2022

2.3K

14,608

28 Jan 2022

Ego4D: Around the World in 3,000 Hours of Egocentric Video

...

Antonio Torralba

Mingfei Yan

1.0K

1,464

13 Oct 2021

ActivityNet-QA: A Dataset for Understanding Complex Web Videos via Question AnsweringAAAI Conference on Artificial Intelligence (AAAI), 2019

Zhou Zhao

301

611

06 Jun 2019

Scaling Egocentric Vision: The EPIC-KITCHENS Dataset

Dima Damen

Sanja Fidler

...

373

1,207

08 Apr 2018

A Weighted Sparse Sampling and Smoothing Frame Transition Approach for Semantic Fast-Forward First-Person Videos

Erickson R. Nascimento

399

23 Feb 2018

Compact CNN for Indexing Egocentric Videos

208

107

28 Apr 2015