We present a systematic study of domain generalization (DG) for tiny neural networks, a problem that is critical to on-device machine learning applications but has been overlooked in the literature where research has been focused on large models only. Tiny neural networks have much fewer parameters and lower complexity, and thus should not be trained the same way as their large counterparts for DG applications. We find that knowledge distillation is a strong candidate for solving the problem: it outperforms state-of-the-art DG methods that were developed using large models with a large margin. Moreover, we observe that the teacher-student performance gap on test data with domain shift is bigger than that on in-distribution data. To improve DG for tiny neural networks without increasing the deployment cost, we propose a simple idea called out-of-distribution knowledge distillation (OKD), which aims to teach the student how the teacher handles (synthetic) out-of-distribution data and is proved to be a promising framework for solving the problem. We also contribute a scalable method of creating DG datasets, called DOmain Shift in COntext (DOSCO), which can be applied to broad data at scale without much human effort. Code and models are released at \url{https://github.com/KaiyangZhou/on-device-dg}.
翻译:我们对微小神经网络的广域化(DG)进行系统研究,这是对设备机学习应用至关重要的问题,但在研究仅侧重于大型模型的文献中却忽略了这个问题。小神经网络的参数要少得多,复杂性要低得多,因此不应像对DG应用的大型对等网络一样受到同样的培训。我们发现,知识蒸馏是解决问题的有力选择:它优于使用大差幅模型开发的最先进的DG方法。此外,我们还注意到,对域变换的测试数据师生的性能差距大于分布数据。为了改进小神经网络的DG,而不增加部署成本,我们提出了一个叫作传播外知识蒸馏(OKD)的简单想法,其目的是向学生传授教师如何处理(合成)分配数据,并证明它是一个很有希望的解决该问题的框架。我们还发现,创建DG数据集、称为Dmain Shift COnde(DOmainZ) Congtraxime) 和广度数据模型(DOCOSCODO)的可升级方法,可以在不使用。