This paper presents a new optimization approach to causal estimation. Given data that contains covariates and an outcome, which covariates are causes of the outcome, and what is the strength of the causality? In classical machine learning (ML), the goal of optimization is to maximize predictive accuracy. However, some covariates might exhibit a non-causal association to the outcome. Such spurious associations provide predictive power for classical ML, but they prevent us from causally interpreting the result. This paper proposes CoCo, an optimization algorithm that bridges the gap between pure prediction and causal inference. CoCo leverages the recently-proposed idea of environments, datasets of covariates/response where the causal relationships remain invariant but where the distribution of the covariates changes from environment to environment. Given datasets from multiple environments -- and ones that exhibit sufficient heterogeneity -- CoCo maximizes an objective for which the only solution is the causal solution. We describe the theoretical foundations of this approach and demonstrate its effectiveness on simulated and real datasets. Compared to classical ML and existing methods, CoCo provides more accurate estimates of the causal model.
翻译:本文展示了一种新的因果估计优化方法。 根据包含共变和结果的数据, 共变是结果的原因, 以及因果关系的力量是什么? 在古典机器学习中, 优化的目标是最大限度地提高预测准确性。 但是, 一些共变可能表现出非因果关联。 这些假协会为古典ML提供了预测力, 但却阻止我们从因果解释结果。 本文提议了CoCo, 一种弥补纯预测和因果推断之间差距的优化算法。 CoCo 利用了最近提出的环境理念, 共变/反应数据集, 其因果关系仍然变化不定, 但其环境变化的分布却从环境到环境的共变。 鉴于来自多种环境的数据集 -- -- 以及表现出充分异性的那些数据集 -- -- COoco 将一个唯一解决办法是因果解决方案的目标最大化。 我们描述了这一方法的理论基础, 并展示了它在模拟和真实数据集上的有效性。 与古典 ML 和现有方法相比, Coco 提供了更准确的因果模型估计。