Domain adaptation arises as an important problem in statistical learning theory when the data-generating processes differ between training and test samples, respectively called source and target domains. Recent theoretical advances show that the success of domain adaptation algorithms heavily relies on their ability to minimize the divergence between the probability distributions of the source and target domains. However, minimizing this divergence cannot be done independently of the minimization of other key ingredients such as the source risk or the combined error of the ideal joint hypothesis. The trade-off between these terms is often ensured by algorithmic solutions that remain implicit and not directly reflected by the theoretical guarantees. To get to the bottom of this issue, we propose in this paper a new theoretical framework for domain adaptation through hierarchical optimal transport. This framework provides more explicit generalization bounds and allows us to consider the natural hierarchical organization of samples in both domains into classes or clusters. Additionally, we provide a new divergence measure between the source and target domains called Hierarchical Wasserstein distance that indicates under mild assumptions, which structures have to be aligned to lead to a successful adaptation.
翻译:当数据生成过程在培训和测试样本(分别称为源和目标领域)之间出现差异时,作为统计学习理论中的一个重要问题,对域的适应是一个重要问题。最近的理论进步表明,域适应算法的成功很大程度上取决于其将源和目标领域概率分布差异最小化的能力。然而,在尽量减少来源风险等其他关键要素或理想联合假设的综合误差的情况下,这种差异是无法做到的。这些术语之间的权衡往往通过算法解决办法得到保证,这些办法仍然隐含,而不是直接反映在理论保障中。为了触及这一问题的底部,我们在本文件中提议了一个新的理论框架,以便通过最优化的等级运输进行域适应。这个框架提供了更明确的概括性界限,并使我们能够考虑将两个域的样品自然等级组织分为类别或组群。此外,我们提供了源和目标领域(称为Hierarchicic Wasserstein)之间新的差异衡量尺度,表明在轻度假设下,这些结构必须加以调整,以导致成功适应。