因果数据聚合问题的异异性评估 (Heterogeneity assessment in causal data fusion problems)

Previous works have formalized the conditions under which findings from a source population could be reasonably extrapolated to another target population, the so-called "transportability" problem. While most of these works focus on a setting with two populations, many recent works have also provided the identifiability of a causal parameter when multiple data sources are available, under certain homogeneity assumptions. However, we know of little work examining transportability when data sources are possibly heterogeneous, e.g. in the distribution of mediators of the exposure-outcome relation. The presence of such heterogeneity generally invalidates the transportability assumption required in most of the literature. In this paper, we will propose a general approach for heterogeneity assessment when estimating the average exposure effect in a target population, with mediator and outcome data obtained from multiple external sources. To account for heterogeneity, we define different effect estimands when the mediator and outcome information is transported from different sources. We discuss the causal assumptions to identify these estimands, then propose efficient semi-parametric estimation strategies that allow the use of flexible data-adaptive machine learning methods to estimate the nuisance parameters. We also propose two new methods to investigate sources of heterogeneity in the transported estimates. These methods will inform users about how much of the observed statistical heterogeneity in the transported effects is due to the differences across data sources in: 1) conditional distribution of mediator variables, and/or 2) conditional distribution of the outcome. We illustrate the proposed methods using four sites that were part of the Moving to Opportunity Study, which was an experiment that randomized housing voucher receipt to participating families living in public housing.

翻译：先前的作品已经正式确定了来源人口调查结果可以合理地推断给另一目标人口的条件,即所谓的“可运输性”问题。虽然这些作品大多侧重于两种人口之间的设定,但许多近期的作品也在某些同质假设下,在具备多种数据源的情况下,提供了因果参数的可识别性。然而,我们知道很少有工作来审查在数据来源可能存在差异时的可运输性,例如接触结果关系的调解人的分布情况。这种不均匀性的存在通常使大多数文献所要求的可运输性假设无效。在本文件中,我们将提出在估算目标人口的平均暴露影响时,采用一般方法进行异质性评估,同时根据多个外部来源获得的调解者和结果数据数据数据数据数据数据数据。考虑到异异质性,我们界定了当数据来源从不同来源迁移时的可转移性估计值不同。我们讨论因果假设以确定这些应得值,然后提出有效的半度估计战略,以便使用灵活数据可调适度的可运输结果。我们还将提出一个总体的实验性评估方法,我们还将用这些观察的统计来源的统计方法用来估计。