Causal inference methods that control for text-based confounders are becoming increasingly important in the social sciences and other disciplines where text is readily available. However, these methods rely on a critical assumption that there is no treatment leakage: that is, the text only contains information about the confounder and no information about treatment assignment. When this assumption does not hold, methods that control for text to adjust for confounders face the problem of post-treatment (collider) bias. However, the assumption that there is no treatment leakage may be unrealistic in real-world situations involving text, as human language is rich and flexible. Language appearing in a public policy document or health records may refer to the future and the past simultaneously, and thereby reveal information about the treatment assignment. In this article, we define the treatment-leakage problem, and discuss the identification as well as the estimation challenges it raises. Second, we delineate the conditions under which leakage can be addressed by removing the treatment-related signal from the text in a pre-processing step we define as text distillation. Lastly, using simulation, we show how treatment leakage introduces a bias in estimates of the average treatment effect (ATE) and how text distillation can mitigate this bias.
翻译:然而,这些方法所依赖的是一个关键假设,即不存在治疗渗漏:也就是说,文本只包含关于治疗疏漏者的信息,没有关于治疗任务的信息。如果这一假设不成立,则用于控制疏漏者调整文本的方法将面临后处理(碰撞)偏差问题。然而,在涉及文本的实际情况中,没有治疗渗漏的假设可能不切实际,因为人类语言丰富和灵活。公共政策文件或健康记录中的语言可能同时提及未来和过去,从而披露关于治疗任务的信息。在本条中,我们界定了治疗渗漏问题,并讨论了识别问题及其引起的估计挑战。第二,我们界定了通过删除处理前阶段中文本中与治疗有关的信号处理渗漏问题的条件,我们定义为文本蒸馏。最后,我们通过模拟,我们展示了治疗渗漏如何在估计平均治疗效果(ATE)和文本稀释方法中产生偏差。