MiniVLN: Efficient Vision-and-Language Navigation by Progressive Knowledge Distillation

IEEE International Conference on Robotics and Automation (ICRA), 2024

27 September 2024

Junyou Zhu

Yanyuan Qiao

Siqi Zhang

Xingjian He

Qi Wu

Jing Liu

VLM

ArXiv (abs)PDF HTML

Main:6 Pages

5 Figures

Bibliography:1 Pages

Abstract

In recent years, Embodied Artificial Intelligence (Embodied AI) has advanced rapidly, yet the increasing size of models conflicts with the limited computational capabilities of Embodied AI platforms. To address this challenge, we aim to achieve both high model performance and practical deployability. Specifically, we focus on Vision-and-Language Navigation (VLN), a core task in Embodied AI. This paper introduces a two-stage knowledge distillation framework, producing a student model, MiniVLN, and showcasing the significant potential of distillation techniques in developing lightweight models. The proposed method aims to capture fine-grained knowledge during the pretraining phase and navigation-specific knowledge during the fine-tuning phase. Our findings indicate that the two-stage distillation approach is more effective in narrowing the performance gap between the teacher model and the student model compared to single-stage distillation. On the public R2R and REVERIE benchmarks, MiniVLN achieves performance on par with the teacher model while having only about 12% of the teacher model's parameter count.

View on arXiv

Comments on this paper