We study the problem of progressive distillation: Given a large, pre-trained teacher model $g$, we seek to decompose the model into an ensemble of smaller, low-inference cost student models $f_i$. The resulting ensemble allows for flexibly tuning accuracy vs. inference cost, which is useful for a number of applications in on-device inference. The method we propose, B-DISTIL, relies on an algorithmic procedure that uses function composition over intermediate activations to construct expressive ensembles with similar performance as $g$, but with much smaller student models. We demonstrate the effectiveness of \algA by decomposing pretrained models across standard image, speech, and sensor datasets. We also provide theoretical guarantees for our method in terms of convergence and generalization.
翻译:我们研究的是渐进蒸馏问题:考虑到一个大型的、预先培训的教师模型,我们试图将模型分解成一个小的、低发费用的学生模型合体($f_i$)。由此形成的合体可以灵活调整精度相对于推论成本的精确度,这对在设计推论中的若干应用是有用的。我们提出的方法,即B-DISTIL, 依靠一种算法程序,利用中间引爆器的功能构成来构建表达式组合,其性能类似于$g$,但学生模型要小得多。我们通过在标准图像、语音和感官数据集中将预先培训的模式分解开来证明 。我们还从理论上保证了我们的方法的趋同性和普遍性。