Causal graphs (CGs) are compact representations of the knowledge of the data generating processes behind the data distributions. When a CG is available, e.g., from the domain knowledge, we can infer the conditional independence (CI) relations that should hold in the data distribution. However, it is not straightforward how to incorporate this knowledge into predictive modeling. In this work, we propose a model-agnostic data augmentation method that allows us to exploit the prior knowledge of the CI encoded in a CG for supervised machine learning. We theoretically justify the proposed method by providing an excess risk bound indicating that the proposed method suppresses overfitting by reducing the apparent complexity of the predictor hypothesis class. Using real-world data with CGs provided by domain experts, we experimentally show that the proposed method is effective in improving the prediction accuracy, especially in the small-data regime.
翻译:Causal 图形(CGs)是数据分布背后数据生成过程知识的缩略图。当有了CG(例如从域知识中)时,我们可以推断出在数据分布中应该保持的有条件独立关系。然而,如何将这种知识纳入预测模型并非直截了当。在这项工作中,我们提出了一个模型-不可知数据增强方法,使我们能够利用CG中编码的CI先前知识来监督机器的学习。我们理论上证明拟议方法是合理的,我们提供了一种超大的风险,表明拟议的方法抑制了预测或假设等级的明显复杂性。我们利用由域专家提供的CGs提供的真实世界数据,实验性地表明,拟议的方法在提高预测准确性方面是有效的,特别是在小数据系统中。