Leveraging text, such as social media posts, for causal inferences requires the use of NLP models to 'learn' and adjust for confounders, which could otherwise impart bias. However, evaluating such models is challenging, as ground truth is almost never available. We demonstrate the need for empirical evaluation frameworks for causal inference in natural language by showing that existing, commonly used models regularly disagree with one another on real world tasks. We contribute the first such framework, generalizing several challenges across these real world tasks. Using this framework, we evaluate a large set of commonly used causal inference models based on propensity scores and identify their strengths and weaknesses to inform future improvements. We make all tasks, data, and models public to inform applications and encourage additional research.
翻译:将文字(如社交媒体文章等)用于因果推论,需要使用NLP模型来“learn”和为混淆者调整,否则可能会产生偏见。然而,评价这些模型具有挑战性,因为几乎根本无法获得地面真相。我们证明现有常用模型经常在现实世界任务上相互分歧,从而证明有必要为自然语言的因果推论建立经验性评价框架。我们为第一个这样的框架作出贡献,将这些现实世界任务中的几项挑战归纳为一般。我们利用这个框架,评估了一套以倾向性评分为基础的、常用的因果推论模型,并找出其长处和短处,为未来的改进提供信息。我们公布所有任务、数据和模型,为应用提供信息并鼓励更多的研究。