Ferret-UI 2: Mastering Universal User Interface Understanding Across Platforms

24 October 2024

Abstract

Building a generalist model for user interface (UI) understanding is challenging due to various foundational issues, such as platform diversity, resolution variation, and data limitation. In this paper, we introduce Ferret-UI 2, a multimodal large language model (MLLM) designed for universal UI understanding across a wide range of platforms, including iPhone, Android, iPad, Webpage, and AppleTV. Building on the foundation of Ferret-UI, Ferret-UI 2 introduces three key innovations: support for multiple platform types, high-resolution perception through adaptive scaling, and advanced task training data generation powered by GPT-4o with set-of-mark visual prompting. These advancements enable Ferret-UI 2 to perform complex, user-centered interactions, making it highly versatile and adaptable for the expanding diversity of platform ecosystems. Extensive empirical experiments on referring, grounding, user-centric advanced tasks (comprising 9 subtasks $\times$ 5 platforms), GUIDE next-action prediction dataset, and GUI-World multi-platform benchmark demonstrate that Ferret-UI 2 significantly outperforms Ferret-UI, and also shows strong cross-platform transfer capabilities.

View on arXiv

@article{li2025_2410.18967,
  title={ Ferret-UI 2: Mastering Universal User Interface Understanding Across Platforms },
  author={ Zhangheng Li and Keen You and Haotian Zhang and Di Feng and Harsh Agrawal and Xiujun Li and Mohana Prasad Sathya Moorthy and Jeff Nichols and Yinfei Yang and Zhe Gan },
  journal={arXiv preprint arXiv:2410.18967},
  year={ 2025 }
}

Comments on this paper