The Early Bird Identifies the Worm: You Can't Beat a Head Start in Long-Term Body Re-ID (ECHO-BID)
A wide range of model-based approaches to long-term person re-identification have been proposed. Whether these models perform more accurately than direct domain transfer learning applied to extensively trained large-scale foundation models is not known. We applied domain transfer learning for long-term person re-id to four vision foundation models (CLIP, DINOv2, AIMv2, and EVA-02). Domain-adapted versions of all four models %CLIP-L, DINOv2-L, AIMv2-L, and EVA-02-L surpassed existing state-of-the-art models by a large margin in highly unconstrained viewing environments. Decision score fusion of the four models improved performance over any individual model. Of the individual models, the EVA-02 foundation model provided the best ``head start'' to long-term re-id, surpassing other models on three of the four performance metrics by substantial margins. Accordingly, we introduce va lothes-Change from idden bjects - ody entification (ECHO-BID), a class of long-term re-id models built on the object-pretrained EVA-02 Large backbones. Ablation experiments varying backbone size, scale of object classification pretraining, and transfer learning protocol indicated that model size and the use of a smaller, but more challenging transfer learning protocol are critical features in performance. We conclude that foundation models provide a head start to domain transfer learning and support state-of-the-art performance with modest amounts of domain data. The limited availability of long-term re-id data makes this approach advantageous.
View on arXiv