Vision-Language Model Predictive Control for Manipulation Planning and Trajectory Generation

7 April 2025

Abstract

Model Predictive Control (MPC) is a widely adopted control paradigm that leverages predictive models to estimate future system states and optimize control inputs accordingly. However, while MPC excels in planning and control, it lacks the capability for environmental perception, leading to failures in complex and unstructured scenarios. To address this limitation, we introduce Vision-Language Model Predictive Control (VLMPC), a robotic manipulation planning framework that integrates the perception power of vision-language models (VLMs) with MPC. VLMPC utilizes a conditional action sampling module that takes a goal image or language instruction as input and leverages VLM to generate candidate action sequences. These candidates are fed into a video prediction model that simulates future frames based on the actions. In addition, we propose an enhanced variant, Traj-VLMPC, which replaces video prediction with motion trajectory generation to reduce computational complexity while maintaining accuracy. Traj-VLMPC estimates motion dynamics conditioned on the candidate actions, offering a more efficient alternative for long-horizon tasks and real-time applications. Both VLMPC and Traj-VLMPC select the optimal action sequence using a VLM-based hierarchical cost function that captures both pixel-level and knowledge-level consistency between the current observation and the task input. We demonstrate that both approaches outperform existing state-of-the-art methods on public benchmarks and achieve excellent performance in various real-world robotic manipulation tasks. Code is available atthis https URL.

View on arXiv

@article{chen2025_2504.05225,
  title={ Vision-Language Model Predictive Control for Manipulation Planning and Trajectory Generation },
  author={ Jiaming Chen and Wentao Zhao and Ziyu Meng and Donghui Mao and Ran Song and Wei Pan and Wei Zhang },
  journal={arXiv preprint arXiv:2504.05225},
  year={ 2025 }
}

Comments on this paper