v1v2 (latest)

EagleVision: A Dual-Stage Framework with BEV-grounding-based Chain-of-Thought for Spatial Intelligence

17 December 2025

Jiaxu Wan

Xu Wang

Mengwei Xie

Hang Zhang

Mu Xu

Yang Han

Hong Zhang

Ding Yuan

Yifan Yang

LRM

ArXiv (abs)PDF HTML Github

Main:8 Pages

7 Figures

Bibliography:2 Pages

6 Tables

Appendix:3 Pages

Abstract

Video-based spatial reasoning -- such as estimating distances, judging directions, or understanding layouts from multiple views -- requires selecting informative frames and, when needed, actively seeking additional viewpoints during inference. Existing multimodal large language models (MLLMs) consume a fixed set of uniformly sampled frames and cannot request new views once reasoning begins, often missing the geometric cues necessary for reliable spatial judgments. We present EagleVision, a dual-stage framework that combines geometry-aware frame selection with active, Bird's-Eye-View (BEV)-grounded reasoning. In the first stage (macro perception), a semantics-perspective-fusion determinantal point process (SPF-DPP) selects a compact set of keyframes that jointly maximize semantic relevance and viewpoint diversity under a fixed token budget. In the second stage (micro verification), the model performs iterative spatial Chain-of-Thought: at each step it can either reason in text or predict a pose on the BEV plane to retrieve the nearest real frame, forming a closed-loop hypothesize-look-verify cycle. The querying policy is trained purely via reinforcement learning with a spatial grounding reward, requiring no human-annotated reasoning traces. On VSI-Bench and SQA3D, EagleVision achieves state-of-the-art performance among open-source vision-language models.

View on arXiv

Comments on this paper