11
0

Progressive Ensemble Distillation: Building Ensembles for Efficient Inference

Abstract

We study the problem of progressive ensemble distillation: Given a large, pretrained teacher model gg, we seek to decompose the model into smaller, low-inference cost student models fif_i, such that progressively evaluating additional models in this ensemble leads to improved predictions. The resulting ensemble allows for flexibly tuning accuracy vs. inference cost at runtime, which is useful for a number of applications in on-device inference. The method we propose, B-DISTIL , relies on an algorithmic procedure that uses function composition over intermediate activations to construct expressive ensembles with similar performance as gg , but with smaller student models. We demonstrate the effectiveness of B-DISTIL by decomposing pretrained models across standard image, speech, and sensor datasets. We also provide theoretical guarantees in terms of convergence and generalization.

View on arXiv
Comments on this paper