In this paper, we connect language model adaptation with concepts of machine learning theory. We consider a training setup with a large out-of-domain set and a small in-domain set. As a first contribution, we derive how the benefit of training a model on either set depends on the size of the sets and the distance between their underlying distribution. As a second contribution, we present how the most popular data selection techniques -- importance sampling, intelligent data selection and influence functions -- can be presented in a common framework which highlights their similarity and also their subtle differences.
翻译:在本文中,我们将语言模式的适应与机器学习理论的概念联系起来。我们考虑建立一个具有大型外域和小型内域组合的培训机构。作为第一个贡献,我们从中推断出,对其中任一组合进行培训的好处如何取决于各组的规模及其基本分布之间的距离。作为第二个贡献,我们提出如何在一个共同框架内展示最受欢迎的数据选择技术 -- -- 重要取样、智能数据选择和影响功能 -- -- 来突出其相似性以及微妙差异。