280
v1v2 (latest)

The Early Bird Identifies the Worm: You Can't Beat a Head Start in Long-Term Body Re-ID (ECHO-BID)

Main:8 Pages
Bibliography:3 Pages
9 Tables
Abstract

A wide range of model-based approaches to long-term person re-identification have been proposed. Whether these models perform more accurately than direct domain transfer learning applied to extensively trained large-scale foundation models is not known. We applied domain transfer learning for long-term person re-id to four vision foundation models (CLIP, DINOv2, AIMv2, and EVA-02). Domain-adapted versions of all four models %CLIP-L, DINOv2-L, AIMv2-L, and EVA-02-L surpassed existing state-of-the-art models by a large margin in highly unconstrained viewing environments. Decision score fusion of the four models improved performance over any individual model. Of the individual models, the EVA-02 foundation model provided the best ``head start'' to long-term re-id, surpassing other models on three of the four performance metrics by substantial margins. Accordingly, we introduce E\textbf{E}va C\textbf{C}lothes-Change from H\textbf{H}idden O\textbf{O}bjects - B\textbf{B}ody ID\textbf{ID}entification (ECHO-BID), a class of long-term re-id models built on the object-pretrained EVA-02 Large backbones. Ablation experiments varying backbone size, scale of object classification pretraining, and transfer learning protocol indicated that model size and the use of a smaller, but more challenging transfer learning protocol are critical features in performance. We conclude that foundation models provide a head start to domain transfer learning and support state-of-the-art performance with modest amounts of domain data. The limited availability of long-term re-id data makes this approach advantageous.

View on arXiv
Comments on this paper