Lawin Transformer: Improving New-Era Vision Backbones with Multi-Scale Representations for Semantic Segmentation

5 January 2022

Haotian Yan

Chuang Zhang

Ming Wu

ViT

ArXiv (abs)PDF HTML Github (125★)

Main:8 Pages

4 Figures

Bibliography:3 Pages

9 Tables

Abstract

The multi-level aggregation (MLA) module has emerged as a critical component for advancing new-era vision back-bones in semantic segmentation. In this paper, we propose Lawin (large window) Transformer, a novel MLA architecture that creatively utilizes multi-scale feature maps from the vision backbone. At the core of Lawin Transformer is the Lawin attention, a newly designed window attention mechanism capable of querying much larger context windows than local windows. We focus on studying the efficient and simplistic application of the large-window paradigm, allowing for flexible regulation of the ratio of large context to query and capturing multi-scale representations. We validate the effectiveness of Lawin Transformer on Cityscapes and ADE20K, consistently demonstrating great superiority to widely-used MLA modules when combined with new-era vision backbones. The code is available at https://github.com/yan-hao-tian/lawin.

View on arXiv

Comments on this paper