Visual Text Processing: A Comprehensive Review and Unified Evaluation

30 April 2025

Abstract

Visual text is a crucial component in both document and scene images, conveying rich semantic information and attracting significant attention in the computer vision community. Beyond traditional tasks such as text detection and recognition, visual text processing has witnessed rapid advancements driven by the emergence of foundation models, including text image reconstruction and text image manipulation. Despite significant progress, challenges remain due to the unique properties that differentiate text from general objects. Effectively capturing and leveraging these distinct textual characteristics is essential for developing robust visual text processing models. In this survey, we present a comprehensive, multi-perspective analysis of recent advancements in visual text processing, focusing on two key questions: (1) What textual features are most suitable for different visual text processing tasks? (2) How can these distinctive text features be effectively incorporated into processing frameworks? Furthermore, we introduce VTPBench, a new benchmark that encompasses a broad range of visual text processing datasets. Leveraging the advanced visual quality assessment capabilities of multimodal large language models (MLLMs), we propose VTPScore, a novel evaluation metric designed to ensure fair and reliable evaluation. Our empirical study with more than 20 specific models reveals substantial room for improvement in the current techniques. Our aim is to establish this work as a fundamental resource that fosters future exploration and innovation in the dynamic field of visual text processing. The relevant repository is available atthis https URL.

View on arXiv

@article{shu2025_2504.21682,
  title={ Visual Text Processing: A Comprehensive Review and Unified Evaluation },
  author={ Yan Shu and Weichao Zeng and Fangmin Zhao and Zeyu Chen and Zhenhang Li and Xiaomeng Yang and Yu Zhou and Paolo Rota and Xiang Bai and Lianwen Jin and Xu-Cheng Yin and Nicu Sebe },
  journal={arXiv preprint arXiv:2504.21682},
  year={ 2025 }
}

Comments on this paper