ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2505.09577
7
0

VTLA: Vision-Tactile-Language-Action Model with Preference Learning for Insertion Manipulation

14 May 2025
Chaofan Zhang
Peng Hao
Xiaoge Cao
Xiaoshuai Hao
Shaowei Cui
Shuo Wang
ArXivPDFHTML
Abstract

While vision-language models have advanced significantly, their application in language-conditioned robotic manipulation is still underexplored, especially for contact-rich tasks that extend beyond visually dominant pick-and-place scenarios. To bridge this gap, we introduce Vision-Tactile-Language-Action model, a novel framework that enables robust policy generation in contact-intensive scenarios by effectively integrating visual and tactile inputs through cross-modal language grounding. A low-cost, multi-modal dataset has been constructed in a simulation environment, containing vision-tactile-action-instruction pairs specifically designed for the fingertip insertion task. Furthermore, we introduce Direct Preference Optimization (DPO) to offer regression-like supervision for the VTLA model, effectively bridging the gap between classification-based next token prediction loss and continuous robotic tasks. Experimental results show that the VTLA model outperforms traditional imitation learning methods (e.g., diffusion policies) and existing multi-modal baselines (TLA/VLA), achieving over 90% success rates on unseen peg shapes. Finally, we conduct real-world peg-in-hole experiments to demonstrate the exceptional Sim2Real performance of the proposed VTLA model. For supplementary videos and results, please visit our project website:this https URL

View on arXiv
@article{zhang2025_2505.09577,
  title={ VTLA: Vision-Tactile-Language-Action Model with Preference Learning for Insertion Manipulation },
  author={ Chaofan Zhang and Peng Hao and Xiaoge Cao and Xiaoshuai Hao and Shaowei Cui and Shuo Wang },
  journal={arXiv preprint arXiv:2505.09577},
  year={ 2025 }
}
Comments on this paper