Many scientific datasets are compositional in nature. Important examples include species abundances in ecology, rock compositions in geology, topic compositions in large-scale text corpora, and sequencing count data in molecular biology. Here, we provide a causal view on compositional data in an instrumental variable setting where the composition acts as the cause. Throughout, we pay particular attention to the interpretation of compositional causes from the viewpoint of interventions and crisply articulate potential pitfalls for practitioners. Focusing on modern high-dimensional microbiome sequencing data as a timely illustrative use case, our analysis first reveals that popular one-dimensional information-theoretic summary statistics, such as diversity and richness, may be insufficient for drawing causal conclusions from ecological data. Instead, we advocate for multivariate alternatives using statistical data transformations and regression techniques that take the special structure of the compositional sample space into account. In a comparative analysis on synthetic and semi-synthetic data we show the advantages and limitations of our proposal. We posit that our framework may provide a useful starting point for cause-effect estimation in the context of compositional data.
翻译:许多科学数据集具有构成性质。重要的例子包括生态物种丰度、地质岩石组成、大型文本公司的专题组成、分子生物学的计算数据排序。这里,我们在一个要素变量环境中对组成数据提供因果观点,而组成构成是其原因。我们始终特别注意从干预措施的角度解释组成原因,并明确阐述从业人员的潜在隐患。我们把现代高维微生物测序数据作为及时说明性使用案例,我们的分析首先揭示,流行的单维信息理论摘要统计,例如多样性和丰富性,可能不足以从生态数据中得出因果关系结论。相反,我们主张采用多种变量替代方法,利用统计数据转换和回归技术,将组成样本空间的特殊结构考虑在内。在对合成和半合成数据进行比较分析时,我们显示了我们提案的优点和局限性。我们假设,我们的框架可以为在构成数据方面进行因果关系估计提供一个有用的起点。