Domain shifts in the training data are common in practical applications of machine learning, they occur for instance when the data is coming from different sources. Ideally, a ML model should work well independently of these shifts, for example, by learning a domain-invariant representation. Moreover, privacy concerns regarding the source also require a domain-invariant representation. In this work, we provide theoretical results that link domain invariant representations -- measured by the Wasserstein distance on the joint distributions -- to a practical semi-supervised learning objective based on a cross-entropy classifier and a novel domain critic. Quantitative experiments demonstrate that the proposed approach is indeed able to practically learn such an invariant representation (between two domains), and the latter also supports models with higher predictive accuracy on both domains, comparing favorably to existing techniques.
翻译:培训数据的主要变化在机器学习的实际应用中司空见惯,例如在数据来自不同来源时发生。理想的情况是,ML模式应独立于这些变化而运作良好,例如,学习域变量代表法;此外,对源的隐私关切也需要域变量代表法。在这项工作中,我们提供理论结果,将域变量表示法 -- -- 以Wasserstein在联合分布上的距离衡量 -- -- 与基于交叉渗透分类器和新颖域名评论器的实用半监督学习目标联系起来。量化实验表明,拟议方法确实能够实际学习这种变量代表法(在两个领域之间),后者还支持两个领域的预测准确度更高的模型,与现有技术进行比较。