Person re-identification (Re-ID) is a crucial task in computer vision, aiming to recognize individuals across non-overlapping camera views. While recent advanced vision-language models (VLMs) excel in logical reasoning and multi-task generalization, their applications in Re-ID tasks remain limited. They either struggle to perform accurate matching based on identity-relevant features or assist image-dominated branches as auxiliary semantics. In this paper, we propose a novel framework ChatReID, that shifts the focus towards a text-side-dominated retrieval paradigm, enabling flexible and interactive re-identification. To integrate the reasoning abilities of language models into Re-ID pipelines, We first present a large-scale instruction dataset, which contains more than 8 million prompts to promote the model fine-tuning. Next. we introduce a hierarchical progressive tuning strategy, which endows Re-ID ability through three stages of tuning, i.e., from person attribute understanding to fine-grained image retrieval and to multi-modal task reasoning. Extensive experiments across ten popular benchmarks demonstrate that ChatReID outperforms existing methods, achieving state-of-the-art performance in all Re-ID tasks. More experiments demonstrate that ChatReID not only has the ability to recognize fine-grained details but also to integrate them into a coherent reasoning process.
View on arXiv@article{niu2025_2502.19958, title={ ChatReID: Open-ended Interactive Person Retrieval via Hierarchical Progressive Tuning for Vision Language Models }, author={ Ke Niu and Haiyang Yu and Mengyang Zhao and Teng Fu and Siyang Yi and Wei Lu and Bin Li and Xuelin Qian and Xiangyang Xue }, journal={arXiv preprint arXiv:2502.19958}, year={ 2025 } }