39
36

Efficient Self-supervised Vision Pretraining with Local Masked Reconstruction

Abstract

Self-supervised learning for computer vision has achieved tremendous progress and improved many downstream vision tasks such as image classification, semantic segmentation, and object detection. Among these, generative self-supervised vision learning approaches such as MAE and BEiT show promising performance. However, their global masked reconstruction mechanism is computationally demanding. To address this issue, we propose local masked reconstruction (LoMaR), a simple yet effective approach that performs masked reconstruction within a small window of 7×\times7 patches on a simple Transformer encoder, improving the trade-off between efficiency and accuracy compared to global masked reconstruction over the entire image. Extensive experiments show that LoMaR reaches 84.1% top-1 accuracy on ImageNet-1K classification, outperforming MAE by 0.5%. After finetuning the pretrained LoMaR on 384×\times384 images, it can reach 85.4% top-1 accuracy, surpassing MAE by 0.6%. On MS COCO, LoMaR outperforms MAE by 0.5 APbox\text{AP}^\text{box} on object detection and 0.5 APmask\text{AP}^\text{mask} on instance segmentation. LoMaR is especially more computation-efficient on pretraining high-resolution images, e.g., it is 3.1×\times faster than MAE with 0.2% higher classification accuracy on pretraining 448×\times448 images. This local masked reconstruction learning mechanism can be easily integrated into any other generative self-supervised learning approach. Our code is publicly available inthis https URL.

View on arXiv
@article{chen2025_2206.00790,
  title={ Efficient Self-supervised Vision Pretraining with Local Masked Reconstruction },
  author={ Jun Chen and Ming Hu and Boyang Li and Mohamed Elhoseiny },
  journal={arXiv preprint arXiv:2206.00790},
  year={ 2025 }
}
Comments on this paper