Many scientific datasets are compositional in nature. Important examples include species abundances in ecology, rock compositions in geology, topic compositions in large-scale text corpora, and sequencing count data in molecular biology. Here, we provide a causal view on compositional data in an instrumental variable setting where the composition acts as the cause. First, we crisply articulate potential pitfalls for practitioners regarding the interpretation of compositional causes from the viewpoint of interventions and warn against attributing causal meaning to common summary statistics such as diversity indices. We then advocate for and develop multivariate methods using statistical data transformations and regression techniques that take the special structure of the compositional sample space into account. In a comparative analysis on synthetic and real data we show the advantages and limitations of our proposal. We posit that our framework provides a useful starting point and guidance for valid and informative cause-effect estimation in the context of compositional data.
翻译:许多科学数据集具有构成性质,重要的例子包括生态物种丰度、地质岩石组成、大型文本公司的专题组成、分子生物学的计数数据排序。这里,我们在组成成因作用所在的工具变量环境中对构成数据提供了因果观点。首先,我们从干预措施的角度明确阐述从业人员在解释构成原因方面的潜在缺陷,并警告不要将因果意义归属于诸如多样性指数等共同摘要统计。然后,我们提倡并发展多种变量方法,利用统计数据转换和回归技术,将组成抽样空间的特殊结构考虑在内。在对合成和实际数据的比较分析中,我们展示了我们提案的优点和局限性。我们假设,我们的框架为在组成数据方面进行有效和信息化的因果关系估算提供了一个有用的起点和指导。