334

Align Yourself: Self-supervised Pre-training for Fine-grained Recognition via Saliency Alignment

British Machine Vision Conference (BMVC), 2021
Siyuan Li
Kai Wang
Lei Shang
Baigui Sun
Hao Li
Stan Z. Li
Abstract

Self-supervised contrastive learning has demonstrated great potential in learning visual representations. Despite their success on various downstream tasks such as image classification and object detection, self-supervised pre-training for fine-grained scenarios is not fully explored. In this paper, we first point out that current contrastive methods are prone to memorizing background/foreground texture and therefore have a limitation in localizing the foreground object. Analysis suggests that learning to extract discriminative texture information and localization are equally crucial for self-supervised pre-training in fine-grained scenarios. Based on our findings, we introduce cross-view saliency alignment (CVSA), a contrastive learning framework that first crops and swaps saliency regions of images as a novel view generation and then guides the model to localize on the foreground object via a cross-view alignment loss. Extensive experiments on four popular fine-grained classification benchmarks show that CVSA significantly improves the learned representation.

View on arXiv
Comments on this paper