MAFormer: A Transformer Network with Multi-scale Attention Fusion for Visual Recognition

31 August 2022

Errui Ding

Abstract

Vision Transformer and its variants have demonstrated great potential in various computer vision tasks. But conventional vision transformers often focus on global dependency at a coarse level, which suffer from a learning challenge on global relationships and fine-grained representation at a token level. In this paper, we introduce Multi-scale Attention Fusion into transformer (MAFormer), which explores local aggregation and global feature extraction in a dual-stream framework for visual recognition. We develop a simple but effective module to explore the full potential of transformers for visual representation by learning fine-grained and coarse-grained features at a token level and dynamically fusing them. Our Multi-scale Attention Fusion (MAF) block consists of: i) a local window attention branch that learns short-range interactions within windows, aggregating fine-grained local features; ii) global feature extraction through a novel Global Learning with Down-sampling (GLD) operation to efficiently capture long-range context information within the whole image; iii) a fusion module that self-explores the integration of both features via attention. Our MAFormer achieves state-of-the-art performance on common vision tasks. In particular, MAFormer-L achieves 85.9 $\%$ Top-1 accuracy on ImageNet, surpassing CSWin-B and LV-ViT-L by 1.7 $\%$ and 0.6 $\%$ respectively. On MSCOCO, MAFormer outperforms the prior art CSWin by 1.7 $\%$ mAPs on object detection and 1.4 $\%$ on instance segmentation with similar-sized parameters, demonstrating the potential to be a general backbone network.

View on arXiv

Comments on this paper