Additive Noise Models (ANM) encode a popular functional assumption that enables learning causal structure from observational data. Due to a lack of real-world data meeting the assumptions, synthetic ANM data are often used to evaluate causal discovery algorithms. Reisach et al. (2021) show that, for common simulation parameters, a variable ordering by increasing variance is closely aligned with a causal order and introduce var-sortability to quantify the alignment. Here, we show that not only variance, but also the fraction of a variable's variance explained by all others, as captured by the coefficient of determination $R^2$, tends to increase along the causal order. Simple baseline algorithms can use $R^2$-sortability to match the performance of established methods. Since $R^2$-sortability is invariant under data rescaling, these algorithms perform equally well on standardized or rescaled data, addressing a key limitation of algorithms exploiting var-sortability. We characterize and empirically assess $R^2$-sortability for different simulation parameters. We show that all simulation parameters can affect $R^2$-sortability and must be chosen deliberately to control the difficulty of the causal discovery task and the real-world plausibility of the simulated data. We provide an implementation of the sortability measures and sortability-based algorithms in our library CausalDisco (https://github.com/CausalDisco/CausalDisco).
翻译:加性噪声模型(ANM)编码了一种流行的功能假定,使得可以通过观测数据学习因果结构。由于缺乏符合这些假设的真实世界数据,通常使用合成ANM数据来评估因果发现算法。Reisach等。(2021)表明,对于常见的模拟参数,按方差递增的变量排序与因果排序紧密对齐,并引入变量分拣性来量化对齐。在这里,我们展示不仅方差,而且一个变量被所有其他变量解释的方差比例,如由决定系数$R^2$捕获,往往沿着因果顺序增加。简单的基准算法可以使用$R^2$-分拣性来匹配已建立方法的性能。由于$R^2$-分拣性在数据重新缩放下是不变的,这些算法在标准化或重新缩放的数据上同样表现良好,解决了利用变量分拣性的算法的一个关键局限性。我们为不同的模拟参数表征和经验评估了$R^2$-分拣可行性。我们展示了所有模拟参数都可以影响$R^2$-分拣可行性,并且必须有意识地选择来控制因果发现任务的难度和模拟数据的真实世界可信度。我们在我们的库CausalDisco (https://github.com/CausalDisco/CausalDisco)中提供了分拣度量和分拣基于算法的实现。