289
v1v2 (latest)

Fine-Grained Classification: Connecting Metadata via Cross-Contrastive Pre-Training

Main:4 Pages
4 Figures
Bibliography:1 Pages
Abstract

Fine-grained visual classification aims to recognize objects belonging to many subordinate categories of a supercategory, where appearance alone often fails to distinguish highly similar classes. We propose a unified framework that integrates image, text, and metadata via cross-contrastive pre-training. We first align the three modality encoders in a shared embedding space and then fine-tune the image and metadata encoders for classification. On NABirds, our approach improves over the baseline by 7.83% and achieves 84.44% top-1 accuracy, outperforming strong multimodal methods.

View on arXiv
Comments on this paper