Suppose we are given two datasets: a labeled dataset and unlabeled dataset which also has additional auxiliary features not present in the first dataset. What is the most principled way to use these datasets together to construct a predictor? The answer should depend upon whether these datasets are generated by the same or different distributions over their mutual feature sets, and how similar the test distribution will be to either of those distributions. In many applications, the two datasets will likely follow different distributions, but both may be close to the test distribution. We introduce the problem of building a predictor which minimizes the maximum loss over all probability distributions over the original features, auxiliary features, and binary labels, whose Wasserstein distance is $r_1$ away from the empirical distribution over the labeled dataset and $r_2$ away from that of the unlabeled dataset. This can be thought of as a generalization of distributionally robust optimization (DRO), which allows for two data sources, one of which is unlabeled and may contain auxiliary features.
翻译:假设给了我们两个数据集: 标签数据集和未贴标签的数据集, 该数据集在第一个数据集中也没有存在额外的辅助功能。 使用这些数据集共同构建预测器的最有原则的方法是什么? 答案应该取决于这些数据集是用相同或不同的分布方式生成的, 以及测试分布与这些分布方式的相似性。 在许多应用程序中, 两个数据集可能遵循不同的分布方式, 但两者可能都接近于测试分布方式 。 我们引入了建立预测器的问题, 该预测器可以将原始特性、 辅助特性和二进制标签的所有概率分布的最大损失最小化, 其瓦瑟斯坦距离在标签数据集上的实际分布距离是$r_ 1美元, 离未贴标签数据集的分布距离是$_ 2美元。 这可以被看作分布稳健的优化( DRO) 的概括化, 它允许两个数据源, 其中之一是未贴标签的, 可能包含辅助特性 。