47

ABot-N0: Technical Report on the VLA Foundation Model for Versatile Embodied Navigation

Zedong Chu
Shichao Xie
Xiaolong Wu
Yanfen Shen
Minghua Luo
Zhengbo Wang
Fei Liu
Xiaoxu Leng
Junjun Hu
Mingyang Yin
Jia Lu
Yingnan Guo
Kai Yang
Jiawei Han
Xu Chen
Yanqing Zhu
Yuxiang Zhao
Xin Liu
Yirong Yang
Ye He
Jiahang Wang
Yang Cai
Tianlin Zhang
Li Gao
Liu Liu
Mingchao Sun
Fan Jiang
Chiyu Wang
Zhicheng Liu
Hongyu Pan
Honglin Han
Zhining Gu
Kuan Yang
Jianfang Zhang
Di Jing
Zihao Guan
Wei Guo
Guoqing Liu
Di Yang
Xiangpo Yang
Menglin Yang
Hongguang Xing
Weiguo Li
Mu Xu
Main:29 Pages
19 Figures
Bibliography:5 Pages
5 Tables
Abstract

Embodied navigation has long been fragmented by task-specific architectures. We introduce ABot-N0, a unified Vision-Language-Action (VLA) foundation model that achieves a ``Grand Unification'' across 5 core tasks: Point-Goal, Object-Goal, Instruction-Following, POI-Goal, and Person-Following. ABot-N0 utilizes a hierarchical ``Brain-Action'' architecture, pairing an LLM-based Cognitive Brain for semantic reasoning with a Flow Matching-based Action Expert for precise, continuous trajectory generation.To support large-scale learning, we developed the ABot-N0 Data Engine, curating 16.9M expert trajectories and 5.0M reasoning samples across 7,802 high-fidelity 3D scenes (10.7 km2\text{km}^2). ABot-N0 achieves new SOTA performance across 7 benchmarks, significantly outperforming specialized models. Furthermore, our Agentic Navigation System integrates a planner with hierarchical topological memory, enabling robust, long-horizon missions in dynamic real-world environments.

View on arXiv
Comments on this paper