Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
2503.12542
Cited By
v1
v2 (latest)
ST-Think: How Multimodal Large Language Models Reason About 4D Worlds from Ego-Centric Videos
16 March 2025
Peiran Wu
Yunze Liu
Chonghan Liu
Xinyi Zheng
VGen
LRM
Re-assign community
ArXiv (abs)
PDF
HTML
HuggingFace (1 upvotes)
Papers citing
"ST-Think: How Multimodal Large Language Models Reason About 4D Worlds from Ego-Centric Videos"
39 / 39 papers shown
SpatialBench: Benchmarking Multimodal Large Language Models for Spatial Cognition
Peiran Xu
Sudong Wang
Yao Zhu
Jianing Li
Yunjian Zhang
LRM
339
1
0
26 Nov 2025
Video2Layout: Recall and Reconstruct Metric-Grounded Cognitive Map for Spatial Reasoning
Yibin Huang
Wang Xu
Wanyue Zhang
Helu Zhi
JingJing Huang
Yangbin Xu
Yangang Sun
Conghui Zhu
Tiejun Zhao
201
0
0
20 Nov 2025
Multimodal Spatial Reasoning in the Large Model Era: A Survey and Benchmarks
Xu Zheng
Zihao Dongfang
Lutao Jiang
Boyuan Zheng
Yulong Guo
...
L. Zhang
Danda Pani Paudel
Nicu Sebe
Luc Van Gool
Xuming Hu
LRM
VLM
721
4
0
29 Oct 2025
Video-LMM Post-Training: A Deep Dive into Video Reasoning with Large Multimodal Models
Yunlong Tang
Jing Bi
Pinxin Liu
Zhenyu Pan
Mingqian Feng
...
Zeliang Zhang
Daiki Shimada
Han Liu
Jiebo Luo
Chenliang Xu
MLLM
OffRL
VLM
LRM
742
8
0
06 Oct 2025
AgentCaster: Reasoning-Guided Tornado Forecasting
Michael Chen
LLMAG
LRM
AI4CE
150
0
0
02 Oct 2025
How Far are VLMs from Visual Spatial Intelligence? A Benchmark-Driven Perspective
S. Yu
Yuxin Chen
Hao Ju
Lianjie Jia
Fuxi Zhang
...
Lin Song
Lijun Wang
Yanwei Li
Y. Shan
Huchuan Lu
LRM
319
9
0
23 Sep 2025
UGC-VideoCaptioner: An Omni UGC Video Detail Caption Model and New Benchmarks
Peiran Wu
Yunze Liu
Zhengdong Zhu
Enmin Zhou
Junxiao Shen
210
2
0
15 Jul 2025
DeepVideo-R1: Video Reinforcement Fine-Tuning via Difficulty-aware Regressive GRPO
Jinyoung Park
Jeehye Na
Jinyoung Kim
H. Kim
OffRL
357
21
0
09 Jun 2025
EgoVLM: Policy Optimization for Egocentric Video Understanding
Ashwin Vinod
Shrey Pandit
Aditya Vavre
Linshen Liu
LRM
215
5
0
03 Jun 2025
SiLVR: A Simple Language-based Video Reasoning Framework
Ce Zhang
Yan-Bo Lin
Ziyang Wang
Mohit Bansal
Gedas Bertasius
LRM
185
7
0
30 May 2025
Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence
Diankun Wu
Fangfu Liu
Yi-Hsin Hung
Yueqi Duan
LRM
283
63
0
29 May 2025
VerIPO: Cultivating Long Reasoning in Video-LLMs via Verifier-Gudied Iterative Policy Optimization
Yunxin Li
Xinyu Chen
Zitao Li
Zhenyu Liu
L. Wang
Tong Lu
Baotian Hu
Min Zhang
OffRL
LRM
395
8
0
25 May 2025
MIRAGE: A Multi-modal Benchmark for Spatial Perception, Reasoning, and Intelligence
Chonghan Liu
Jian Shu
Felix Henry
Pu Miao
Yajie Zhang
Yu Zhao
Peiran Wu
VLM
406
2
0
15 May 2025
EchoInk-R1: Exploring Audio-Visual Reasoning in Multimodal LLMs via Reinforcement Learning
Zhenghao Xing
Xiaowei Hu
Chi-Wing Fu
Wei Wang
Jifeng Dai
Pheng-Ann Heng
MLLM
OffRL
VLM
LRM
347
12
0
07 May 2025
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
DeepSeek-AI
Daya Guo
Dejian Yang
Haowei Zhang
Junxiao Song
...
Shiyu Wang
S. Yu
Shunfeng Zhou
Shuting Pan
S.S. Li
OffRL
AI4TS
LRM
ReLM
VLM
1.2K
5,342
0
22 Jan 2025
X-LeBench: A Benchmark for Extremely Long Egocentric Video Understanding
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2025
Wenqi Zhou
Kai Cao
Hao Zheng
Xinyi Zheng
Xinyi Zheng
...
Per Ola Kristensson
Fan Zhang
Fan Zhang
Weizhe Lin
Junxiao Shen
VLM
207
3
0
12 Jan 2025
Building a Mind Palace: Structuring Environment-Grounded Semantic Graphs for Effective Long Video Analysis with LLMs
Computer Vision and Pattern Recognition (CVPR), 2025
Zeyi Huang
Zeyi Huang
Xiaofang Wang
Nikhil Mehta
Tong Xiao
...
Bolin Lai
Licheng Yu
Ning Zhang
Yong Jae Lee
Miao Liu
198
7
0
08 Jan 2025
Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces
Computer Vision and Pattern Recognition (CVPR), 2024
Jihan Yang
Shusheng Yang
Anjali W. Gupta
Rilyn Han
Li Fei-Fei
Saining Xie
LRM
519
341
0
18 Dec 2024
EgoPlan-Bench2: A Benchmark for Multimodal Large Language Model Planning in Real-World Scenarios
Lu Qiu
Yuying Ge
Yi Chen
Yixiao Ge
Mingyu Ding
Xihui Liu
LLMAG
LRM
416
19
0
05 Dec 2024
HourVideo: 1-Hour Video-Language Understanding
Neural Information Processing Systems (NeurIPS), 2024
Keshigeyan Chandrasegaran
Agrim Gupta
Lea M. Hadzic
Taran Kota
Jimming He
Cristobal Eyzaguirre
Zane Durante
Pengfei Yu
Jiajun Wu
L. Fei-Fei
VLM
290
83
0
07 Nov 2024
FMBench: Benchmarking Fairness in Multimodal Large Language Models on Medical Tasks
Peiran Wu
Che Liu
Chong Chen
Jun Li
Cosmin I. Bercea
Rossella Arcucci
245
13
0
01 Oct 2024
AMEGO: Active Memory from long EGOcentric videos
European Conference on Computer Vision (ECCV), 2024
Gabriele Goletto
Tushar Nagarajan
Giuseppe Averta
Dima Damen
EgoV
241
19
0
17 Sep 2024
LLaVA-OneVision: Easy Visual Task Transfer
Bo Li
Yuanhan Zhang
Dong Guo
Renrui Zhang
Feng Li
Hao Zhang
Kaichen Zhang
Yanwei Li
Ziwei Liu
Chunyuan Li
MLLM
SyDa
VLM
567
1,747
0
06 Aug 2024
LongVideoBench: A Benchmark for Long-context Interleaved Video-Language Understanding
Haoning Wu
Dongxu Li
Bei Chen
Junnan Li
250
355
0
22 Jul 2024
Nymeria: A Massive Collection of Multimodal Egocentric Daily Motion in the Wild
European Conference on Computer Vision (ECCV), 2024
Lingni Ma
Yuting Ye
Fangzhou Hong
Vladimir Guzov
Yifeng Jiang
...
C. Karen Liu
Ziwei Liu
Jakob Engel
R. D. Nardi
Richard Newcombe
250
65
0
14 Jun 2024
SpatialRGPT: Grounded Spatial Reasoning in Vision Language Model
An-Chieh Cheng
Hongxu Yin
Yang Fu
Qiushan Guo
Ruihan Yang
Jan Kautz
Xiaolong Wang
Sifei Liu
LRM
278
185
0
03 Jun 2024
Aria Everyday Activities Dataset
Zhaoyang Lv
Nickolas Charron
Pierre Moulon
Alexander Gamino
Cheng Peng
...
Yuyang Zou
Richard Newcombe
Jakob Julian Engel
Xiaqing Pan
Carl Ren
173
24
0
20 Feb 2024
SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities
Computer Vision and Pattern Recognition (CVPR), 2024
Boyuan Chen
Zhuo Xu
Sean Kirmani
Brian Ichter
Danny Driess
Pete Florence
Dorsa Sadigh
Leonidas Guibas
Fei Xia
LRM
ReLM
323
538
0
22 Jan 2024
EgoSchema: A Diagnostic Benchmark for Very Long-form Video Language Understanding
Neural Information Processing Systems (NeurIPS), 2023
K. Mangalam
Raiymbek Akshulakov
Jitendra Malik
402
495
0
17 Aug 2023
Forward-Backward Reasoning in Large Language Models for Mathematical Verification
Annual Meeting of the Association for Computational Linguistics (ACL), 2023
Weisen Jiang
Han Shi
L. Yu
Zheng Liu
Yu Zhang
Zhenguo Li
James T. Kwok
LRM
483
45
0
15 Aug 2023
Aria Digital Twin: A New Benchmark Dataset for Egocentric 3D Machine Perception
IEEE International Conference on Computer Vision (ICCV), 2023
Xiaqing Pan
Nicholas Charron
Yongqiang Yang
Scott Peters
Thomas Whelan
Chen Kong
Omkar M. Parkhi
Richard Newcombe
C. Ren
VGen
380
105
0
10 Jun 2023
Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models
Annual Meeting of the Association for Computational Linguistics (ACL), 2023
Muhammad Maaz
H. Rasheed
Salman Khan
Fahad Shahbaz Khan
MLLM
450
953
0
08 Jun 2023
HOI4D: A 4D Egocentric Dataset for Category-Level Human-Object Interaction
Computer Vision and Pattern Recognition (CVPR), 2022
Yunze Liu
Yun-Hai Liu
Chen Jiang
Kangbo Lyu
Weikang Wan
Hao Shen
Bo-Hua Liang
Zhoujie Fu
He Wang
Li Yi
475
263
0
03 Mar 2022
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
Neural Information Processing Systems (NeurIPS), 2022
Jason W. Wei
Xuezhi Wang
Dale Schuurmans
Maarten Bosma
Brian Ichter
F. Xia
Ed H. Chi
Quoc Le
Denny Zhou
LM&Ro
LRM
AI4CE
ReLM
2.3K
14,608
0
28 Jan 2022
Ego4D: Around the World in 3,000 Hours of Egocentric Video
Kristen Grauman
Andrew Westbury
Eugene Byrne
Zachary Chavis
Antonino Furnari
...
Mike Zheng Shou
Antonio Torralba
Lorenzo Torresani
Mingfei Yan
Jitendra Malik
EgoV
1.0K
1,464
0
13 Oct 2021
ActivityNet-QA: A Dataset for Understanding Complex Web Videos via Question Answering
AAAI Conference on Artificial Intelligence (AAAI), 2019
Zhou Yu
D. Xu
Jun-chen Yu
Ting Yu
Zhou Zhao
Yueting Zhuang
Dacheng Tao
301
611
0
06 Jun 2019
Scaling Egocentric Vision: The EPIC-KITCHENS Dataset
Dima Damen
Hazel Doughty
G. Farinella
Sanja Fidler
Antonino Furnari
...
Davide Moltisanti
Jonathan Munro
Toby Perrett
Will Price
Michael Wray
EgoV
373
1,207
0
08 Apr 2018
A Weighted Sparse Sampling and Smoothing Frame Transition Approach for Semantic Fast-Forward First-Person Videos
M. Silva
W. Ramos
Joao Klock Ferreira
Felipe C. Chamone
M. Campos
Erickson R. Nascimento
399
33
0
23 Feb 2018
Compact CNN for Indexing Egocentric Videos
Y. Poleg
Ariel Ephrat
Shmuel Peleg
Chetan Arora
EgoV
208
107
0
28 Apr 2015
1