Causal inference using observational text data is becoming increasingly popular in many research areas. This paper presents the Bayesian Topic Regression (BTR) model that uses both text and numerical information to model an outcome variable. It allows estimation of both discrete and continuous treatment effects. Furthermore, it allows for the inclusion of additional numerical confounding factors next to text data. To this end, we combine a supervised Bayesian topic model with a Bayesian regression framework and perform supervised representation learning for the text features jointly with the regression parameter training, respecting the Frisch-Waugh-Lovell theorem. Our paper makes two main contributions. First, we provide a regression framework that allows causal inference in settings when both text and numerical confounders are of relevance. We show with synthetic and semi-synthetic datasets that our joint approach recovers ground truth with lower bias than any benchmark model, when text and numerical features are correlated. Second, experiments on two real-world datasets demonstrate that a joint and supervised learning strategy also yields superior prediction results compared to strategies that estimate regression weights for text and non-text features separately, being even competitive with more complex deep neural networks.
翻译:在许多研究领域,使用观测文本数据的因果关系推断在许多研究领域越来越受欢迎。本文件展示了使用文本和数字信息的贝耶斯专题回归模型(BTR)模型,该模型使用文本和数字信息来模拟结果变量。它允许估算离散和连续处理效应。此外,它允许在文本数据旁边添加额外的数字混杂因素。为此,我们将一个受监督的贝耶斯主题模型与巴耶斯回归框架结合起来,并对文本特征进行有监督的演示学习,同时进行回归参数培训,同时尊重Frisch-Waugh-Lovell理论。我们的文件作出了两个主要贡献。首先,我们提供了一个回归框架,允许在文本和非数字组合者都具有相关性的情况下,在各种环境中进行因果关系推论。我们用合成和半合成数据集显示,我们的共同方法恢复了地面真理,在文本和数字特征相关时,其偏差比任何基准模型都低。第二,对两个真实世界数据集进行的实验表明,联合和有监督的学习战略也产生优于估计文本和非文字特征回归权的战略的预测结果。我们用更具有竞争性的深层次网络。