利用受监督的机器学习进行观测科学的因果发现 (Causal discovery for observational sciences using supervised machine learning)

Causal inference can estimate causal effects, but unless data are collected experimentally, statistical analyses must rely on pre-specified causal models. Causal discovery algorithms are empirical methods for constructing such causal models from data. Several asymptotically correct methods already exist, but they generally struggle on smaller samples. Moreover, most methods focus on very sparse causal models, which may not always be a realistic representation of real-life data generating mechanisms. Finally, while causal relationships suggested by the methods often hold true, their claims about causal non-relatedness have high error rates. This non-conservative error tradeoff is not ideal for observational sciences, where the resulting model is directly used to inform causal inference: A causal model with many missing causal relations entails too strong assumptions and may lead to biased effect estimates. We propose a new causal discovery method that addresses these three shortcomings: Supervised learning discovery (SLdisco). SLdisco uses supervised machine learning to obtain a mapping from observational data to equivalence classes of causal models. We evaluate SLdisco in a large simulation study based on Gaussian data and we consider several choices of model size and sample size. We find that SLdisco is more conservative, only moderately less informative and less sensitive towards sample size than existing procedures. We furthermore provide a real epidemiological data application. We use random subsampling to investigate real data performance on small samples and again find that SLdisco is less sensitive towards sample size and hence seems to better utilize the information available in small datasets.

翻译：原因推论可以估计因果关系, 但除非通过实验收集数据, 统计分析必须依赖预先确定的因果关系模型。原因发现算法是从数据中建立这种因果关系模型的实验方法。已经存在一些微小的正确方法, 但它们一般会争夺较小的样本。此外, 多数方法侧重于非常稀少的因果关系模型, 这可能并不总是真实地反映真实数据生成机制。最后, 虽然方法建议的因果关系往往是正确的, 但它们关于因果关系非关联性的说法存在很高的误差率。这种非保守的误差权衡对于观测科学来说并不理想, 由此得出的模型直接用来为因果关系推断提供信息: 许多缺失因果关系的因果关系模型包含过强的假设, 并可能导致偏差的影响估计。我们提出了一种新的因果关系发现方法, 解决这三种缺陷: 超常的学习发现( SLdisco) 。 SLdisco 使用监督的机器学习方法从观测数据到因果关系模型的等同等级, 它们的误差率率率率很高。我们用大型模拟研究中SLdisco, 我们发现, 更精确的模型和子样本程序只能使用一些真正的选择。