444

Learning Manipulation Skills through Robot Chain-of-Thought with Sparse Failure Guidance

Abstract

The acquisition of manipulation skills through language instruction remains an unresolved challenge. Recently, vision-language models have made significant progress in teaching robots these skills. However, their performance is restricted to a narrow range of simple tasks. In this paper, we propose that vision-language models can provide a superior source of rewards for agents. Our method decomposes complex tasks into simpler sub-goals, enabling better task comprehension and avoiding potential failures with sparse failure guidance. Empirical evidence demonstrates that our algorithm consistently outperforms baselines such as CLIP, LIV, and RoboCLIP. Specifically, our algorithm achieves a 5.4×5.4\times higher average success rate compared to the best baseline, RoboCLIP, across a series of manipulation tasks. It has shown a comprehensive understanding of a wide range of robotic manipulation tasks.

View on arXiv
Comments on this paper