Distilling Optimal Neural Networks: Rapid Search in Diverse Spaces

IEEE International Conference on Computer Vision (ICCV), 2020

16 December 2020

Abstract

Today, state-of-the-art Neural Architecture Search (NAS) methods cannot scale to many hardware platforms or scenarios at a low training costs and/or can only handle non-diverse, heavily constrained architectural search-spaces. To solve these issues, we present DONNA (Distilling Optimal Neural Network Architectures), a novel pipeline for rapid and diverse NAS, that scales to many user scenarios. In DONNA, a search consists of three phases. First, an accuracy predictor is built using blockwise knowledge distillation. This predictor enables searching across diverse networks with varying macro-architectural parameters such as layer types and attention mechanisms as well as across micro-architectural parameters such as block repeats and expansion rates. Second, a rapid evolutionary search phase finds a set of Pareto-optimal architectures for any scenario using the accuracy predictor and on-device measurements. Third, optimal models are quickly finetuned to training-from-scratch accuracy. With this approach, DONNA is up to 100x faster than MNasNet in finding state-of-the-art architectures on-device. Classifying ImageNet, DONNA architectures are 20% faster than EfficientNet-B0 and MobileNetV2 on a Nvidia V100 GPU and 10% faster with 0.5% higher accuracy than MobileNetV2-1.4x on a Samsung S20 smartphone. In addition to NAS, DONNA is used for search-space extension and exploration, as well as hardware-aware model compression.

View on arXiv

Comments on this paper