具有(表面)重叠侵权行为的文本数据的因果估计 (Causal Estimation for Text Data with (Apparent) Overlap Violations)

Consider the problem of estimating the causal effect of some attribute of a text document; for example: what effect does writing a polite vs. rude email have on response time? To estimate a causal effect from observational data, we need to adjust for confounding aspects of the text that affect both the treatment and outcome -- e.g., the topic or writing level of the text. These confounding aspects are unknown a priori, so it seems natural to adjust for the entirety of the text (e.g., using a transformer). However, causal identification and estimation procedures rely on the assumption of overlap: for all levels of the adjustment variables, there is randomness leftover so that every unit could have (not) received treatment. Since the treatment here is itself an attribute of the text, it is perfectly determined, and overlap is apparently violated. The purpose of this paper is to show how to handle causal identification and obtain robust causal estimation in the presence of apparent overlap violations. In brief, the idea is to use supervised representation learning to produce a data representation that preserves confounding information while eliminating information that is only predictive of the treatment. This representation then suffices for adjustment and can satisfy overlap. Adapting results on non-parametric estimation, we find that this procedure is robust to conditional outcome misestimation, yielding a low-bias estimator with valid uncertainty quantification under weak conditions. Empirical results show strong improvements in bias and uncertainty quantification relative to the natural baseline.

翻译：考虑估计文本文件某些属性的因果关系的问题;例如:写一个礼貌的电子邮件还是粗鲁的电子邮件对答复时间的影响是什么?为了估计观察数据的因果关系影响,我们需要作出调整,以适应文本中既影响处理方式又影响结果的混杂方面 -- -- 例如文本的专题或写法水平。这些混乱方面是事先未知的,因此似乎自然地对整个文本进行调整(例如,使用变压器)。然而,因果识别和估计程序取决于重叠的假设:对于所有等级的调整变量而言,留有随机的剩余性,因此每个单位都可能(没有)得到处理。由于这里的处理本身是文本的属性,因此完全确定,而且显然有重叠现象。本文的目的是表明如何处理因果识别问题,并在明显重叠的情况下获得稳健的因果估计。简而言之,设想是利用监督性的代表学习产生一种数据代表,以保持信息的一致性,同时消除信息只是预测了相对的优劣的处理方式。这一表述随后足以使量化结果达到稳健的准确性,因此,因此,通过精确性估算性调整后,可以得出稳健的排序的结果。