Feature selection is an important problem in machine learning, which aims to select variables that lead to an optimal predictive model. In this paper, we focus on feature selection for post-intervention outcome prediction from pre-intervention variables. We are motivated by healthcare settings, where the goal is often to select the treatment that will maximize a specific patient's outcome; however, we often do not have sufficient randomized control trial data to identify well the conditional treatment effect. We show how we can use observational data to improve feature selection and effect estimation in two cases: (a) using observational data when we know the causal graph, and (b) when we do not know the causal graph but have observational and limited experimental data. Our paper extends the notion of Markov boundary to treatment-outcome pairs. We provide theoretical guarantees for the methods we introduce. In simulated data, we show that combining observational and experimental data improves feature selection and effect estimation.
翻译:在机器学习中,选择特征是一个重要问题,目的是选择能够导致最佳预测模型的变量。在本文中,我们侧重于从干预前变量中选择干预后结果预测的特征。我们受保健环境的驱动,我们的目标往往是选择能够使特定病人的结果最大化的治疗;然而,我们往往没有足够的随机控制试验数据来很好地确定有条件治疗的效果。我们展示了我们如何利用观测数据来改进特征选择和影响估计,在两种情况下:(a)当我们了解因果图时使用观测数据,以及(b)当我们不知道因果图但有观察性和有限的实验数据时。我们的文件将Markov边界的概念扩大到治疗结果组合。我们为我们引入的方法提供了理论保证。在模拟数据中,我们显示观测数据与实验数据相结合可以改进特征选择和影响估计。