Align Yourself: Self-supervised Pre-training for Fine-grained Recognition via Saliency Alignment

British Machine Vision Conference (BMVC), 2021

30 June 2021

Di Wu

Siyuan Li

Z. Zang

Kai Wang

Lei Shang

Baigui Sun

Hao Li

Stan Z. Li

SSL

ArXiv (abs)PDF HTML

Abstract

Self-supervised contrastive learning has demonstrated great potential in learning visual representations. Despite their success on various downstream tasks such as image classification and object detection, self-supervised pre-training for fine-grained scenarios is not fully explored. In this paper, we first point out that current contrastive methods are prone to memorizing background/foreground texture and therefore have a limitation in localizing the foreground object. Analysis suggests that learning to extract discriminative texture information and localization are equally crucial for self-supervised pre-training in fine-grained scenarios. Based on our findings, we introduce cross-view saliency alignment (CVSA), a contrastive learning framework that first crops and swaps saliency regions of images as a novel view generation and then guides the model to localize on the foreground object via a cross-view alignment loss. Extensive experiments on four popular fine-grained classification benchmarks show that CVSA significantly improves the learned representation.

View on arXiv

Comments on this paper