The establishment of the link between causality and unsupervised domain adaptation (UDA)/semi-supervised learning (SSL) has led to methodological advances in these learning problems in recent years. However, a formal theory that explains the role of causality in the generalization performance of UDA/SSL is still lacking. In this paper, we consider the UDA/SSL setting where we access m labeled source data and n unlabeled target data as training instances under a parametric probabilistic model. We study the learning performance (e.g., excess risk) of prediction in the target domain. Specifically, we distinguish two scenarios: the learning problem is called causal learning if the feature is the cause and the label is the effect, and is called anti-causal learning otherwise. We show that in causal learning, the excess risk depends on the size of the source sample at a rate of O(1/m) only if the labelling distribution between the source and target domains remains unchanged. In anti-causal learning, we show that the unlabeled data dominate the performance at a rate of typically O(1/n). Our analysis is based on the notion of potential outcome random variables and information theory. These results bring out the relationship between the data sample size and the hardness of the learning problem with different causal mechanisms.
翻译:建立因果关系和不受监督的领域适应(UDA)/半监督的学习(SSL)之间的联系导致近年来这些学习问题在方法上取得了进展,然而,目前仍然缺乏解释UDA/SSL一般表现的因果关系作用的正式理论。在本文中,我们认为UDA/SSL设置中,我们获取标记的源数据和未标记的目标数据,作为参数概率模型下的培训实例。我们研究了目标领域的预测的学习性能(如超风险),具体地说,我们区分了两种情景:学习问题被称为因果学习,如果其特征是原因,标签是效果,则称为反因果学习。我们的分析表明,在因果学习中,过度风险取决于源样本在O(1m)的大小,只有当源和目标区域之间的标签分布保持不变。在反癌症学习中,我们发现,未标记的数据以典型的O1/1/n速度控制着绩效。我们的分析是以这些结果的概率和结果的概率模型之间,而随机分析则基于不同的因果关系机制。