We present a systematic study of domain generalization (DG) for tiny neural networks. This problem is critical to on-device machine learning applications but has been overlooked in the literature where research has been merely focused on large models. Tiny neural networks have much fewer parameters and lower complexity and therefore should not be trained the same way as their large counterparts for DG applications. By conducting extensive experiments, we find that knowledge distillation (KD), a well-known technique for model compression, is much better for tackling the on-device DG problem than conventional DG methods. Another interesting observation is that the teacher-student gap on out-of-distribution data is bigger than that on in-distribution data, which highlights the capacity mismatch issue as well as the shortcoming of KD. We further propose a method called out-of-distribution knowledge distillation (OKD) where the idea is to teach the student how the teacher handles out-of-distribution data synthesized via disruptive data augmentation. Without adding any extra parameter to the model -- hence keeping the deployment cost unchanged -- OKD significantly improves DG performance for tiny neural networks in a variety of on-device DG scenarios for image and speech applications. We also contribute a scalable approach for synthesizing visual domain shifts, along with a new suite of DG datasets to complement existing testbeds.
翻译:我们对微小神经网络的广域化(DG)进行系统研究。这个问题对于在设备机上学习应用至关重要,但在研究仅仅侧重于大型模型的文献中却被忽略了。小神经网络的参数少得多,复杂性低得多,因此不应像对DG应用的大型对等网络一样受到同样的培训。通过进行广泛的实验,我们发现知识蒸馏(KD),一种众所周知的模型压缩技术,比常规的DG方法更有助于解决在线DG问题。另一个有趣的观察是,在分配数据外的师生差距大于分配数据方面的师生差距,这突出表明了能力错配问题和KD的短处。我们进一步提出了一种方法,叫做传播知识蒸馏(OKD),其想法是教导学生教师如何处理通过干扰数据增强合成的分配数据。在模型中不增加任何额外的参数 -- 从而保持部署成本不变 -- OKD大大改进了分配数据流中小神经网络的性能,这突出了能力不匹配问题以及KDG的短处图像变化,我们还提出了一种在可视域DG的图像上进行新的图像转换。