Fine-Grained Recognition with Automatic and Efficient Part Attention
Fine-grained recognition is challenging due to the subtle local inter-class differences versus the large intra-class variations such as poses. A key to address this problem is to localize discriminative parts to extract pose-invariant features. However, ground-truth part annotations can be expensive to acquire. Moreover, it is hard to define parts for many fine-grained classes. This work introduces Fully Convolutional Attention Networks (FCANs), a reinforcement learning framework to optimally glimpse local discriminative regions adaptive to different fine-grained domains. Compared to previous methods, our approach enjoys four advantages: 1) the three components including feature extraction, visual attention and fine-grained classification are unified in an end-to-end system; 2) the weakly-supervised reinforcement learning procedure requires no expensive part annotations; 3) the fully-convolutional architecture speeds up both training and testing; 4) the greedy reward strategy accelerates the convergence of the learning. We demonstrate the effectiveness of our method with extensive experiments on four challenging fine-grained benchmark datasets, including Stanford Dogs, Stanford Cars, CUB-200-2011 and Food-101.
View on arXiv