In the problem of domain generalization (DG), there are labeled training data sets from several related prediction problems, and the goal is to make accurate predictions on future unlabeled data sets that are not known to the learner. This problem arises in several applications where data distributions fluctuate because of environmental, technical, or other sources of variation. We introduce a formal framework for DG, and argue that it can be viewed as a kind of supervised learning problem by augmenting the original feature space with the marginal distribution of feature vectors. While our framework has several connections to conventional analysis of supervised learning algorithms, several unique aspects of DG require new methods of analysis. This work lays the learning theoretic foundations of domain generalization, building on our earlier conference paper where the problem of DG was introduced (Blanchard et al., 2011). We present two formal models of data generation, corresponding notions of risk, and distribution-free generalization error analysis. By focusing our attention on kernel methods, we also provide more quantitative results and a universally consistent algorithm. An efficient implementation is provided for this algorithm, which is experimentally compared to a pooling strategy on one synthetic and three real-world data sets.
翻译:在广域化问题(DG)中,有来自若干相关预测问题的有标签的培训数据集,目标是对未来未贴标签的数据集作出准确的预测,而学习者并不了解这些数据。这个问题出现在一些应用中,数据分布因环境、技术或其他变异来源而波动。我们为DG引入了一个正式框架,并争辩说,通过增加原特征空间和特性矢量的边际分布,它可以被视为一种受监督的学习问题。虽然我们的框架与对受监督的学习算法进行常规分析有几条联系,但DG的若干独特方面需要新的分析方法。这项工作以我们早先提出的DG问题(Blanchard等人,2011年)的会议文件为基础,为域化概括化学习理论基础。我们提出了两个正式的数据生成模式,相应的风险概念,以及无分布式的通用错误分析。我们把注意力集中在内核方法上,我们还提供了更多的量化结果和普遍一致的算法。我们为这一算法提供了有效的执行方法,这个算法是实验性的,与一个合成和三个真实世界数据集的集合战略相比较。