We propose iSegFormer, a memory-efficient transformer that combines a Swin transformer with a lightweight multilayer perceptron (MLP) decoder. With the efficient Swin transformer blocks for hierarchical self-attention and the simple MLP decoder for aggregating both local and global attention, iSegFormer learns powerful representations while achieving high computational efficiencies. Specifically, we apply iSegFormer to interactive 3D medical image segmentation.
View on arXiv