ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2503.03734
64
4

OTTER: A Vision-Language-Action Model with Text-Aware Visual Feature Extraction

5 March 2025
Huang Huang
Fangchen Liu
Letian Fu
Tingfan Wu
Mustafa Mukadam
Jitendra Malik
Ken Goldberg
Pieter Abbeel
    LM&Ro
    VLM
ArXivPDFHTML
Abstract

Vision-Language-Action (VLA) models aim to predict robotic actions based on visual observations and language instructions. Existing approaches require fine-tuning pre-trained visionlanguage models (VLMs) as visual and language features are independently fed into downstream policies, degrading the pre-trained semantic alignments. We propose OTTER, a novel VLA architecture that leverages these existing alignments through explicit, text-aware visual feature extraction. Instead of processing all visual features, OTTER selectively extracts and passes only task-relevant visual features that are semantically aligned with the language instruction to the policy transformer. This allows OTTER to keep the pre-trained vision-language encoders frozen. Thereby, OTTER preserves and utilizes the rich semantic understanding learned from large-scale pre-training, enabling strong zero-shot generalization capabilities. In simulation and real-world experiments, OTTER significantly outperforms existing VLA models, demonstrating strong zeroshot generalization to novel objects and environments. Video, code, checkpoints, and dataset:this https URL.

View on arXiv
@article{huang2025_2503.03734,
  title={ OTTER: A Vision-Language-Action Model with Text-Aware Visual Feature Extraction },
  author={ Huang Huang and Fangchen Liu and Letian Fu and Tingfan Wu and Mustafa Mukadam and Jitendra Malik and Ken Goldberg and Pieter Abbeel },
  journal={arXiv preprint arXiv:2503.03734},
  year={ 2025 }
}
Comments on this paper