HiRT: Enhancing Robotic Control with Hierarchical Robot Transformers

12 September 2024

Jianke Zhang

Xiaoyu Chen

Jianyu Chen

Abstract

Large Vision-Language-Action (VLA) models, leveraging powerful pre trained Vision-Language Models (VLMs) backends, have shown promise in robotic control due to their impressive generalization ability. However, the success comes at a cost. Their reliance on VLM backends with billions of parameters leads to high computational costs and inference latency, limiting the testing scenarios to mainly quasi-static tasks and hindering performance in dynamic tasks requiring rapid interactions. To address these limitations, this paper proposes HiRT, a Hierarchical Robot Transformer framework that enables flexible frequency and performance trade-off. HiRT keeps VLMs running at low frequencies to capture temporarily invariant features while enabling real-time interaction through a high-frequency vision-based policy guided by the slowly updated features. Experiment results in both simulation and real-world settings demonstrate significant improvements over baseline methods. Empirically, in static tasks, we double the control frequency and achieve comparable success rates. Additionally, on novel real-world dynamic ma nipulation tasks which are challenging for previous VLA models, HiRT improves the success rate from 48% to 75%.

View on arXiv

@article{zhang2025_2410.05273,
  title={ HiRT: Enhancing Robotic Control with Hierarchical Robot Transformers },
  author={ Jianke Zhang and Yanjiang Guo and Xiaoyu Chen and Yen-Jen Wang and Yucheng Hu and Chengming Shi and Jianyu Chen },
  journal={arXiv preprint arXiv:2410.05273},
  year={ 2025 }
}

Comments on this paper