When estimating a regression model, we might have data where some labels are missing, or our data might be biased by a selection mechanism. When the response or selection mechanism is ignorable (i.e., independent of the response variable given the features) one can use off-the-shelf regression methods; in the nonignorable case one typically has to adjust for bias. We observe that privileged data (i.e. data that is only available during training) might render a nonignorable selection mechanism ignorable, and we refer to this scenario as Privilegedly Missing at Random (PMAR). We propose a novel imputation-based regression method, named repeated regression, that is suitable for PMAR. We also consider an importance weighted regression method, and a doubly robust combination of the two. The proposed methods are easy to implement with most popular out-of-the-box regression algorithms. We empirically assess the performance of the proposed methods with extensive simulated experiments and on a synthetically augmented real-world dataset. We conclude that repeated regression can appropriately correct for bias, and can have considerable advantage over weighted regression, especially when extrapolating to regions of the feature space where response is never observed.
翻译:当估计回归模型时,我们可能有一些标签缺失的数据,或者因为选择机制的缘故,我们的数据可能存在偏差。当响应或选择机制是可忽略的(即给定特征时,响应变量独立于选择机制)时,可以使用现成的回归方法;在不可忽略的情况下,通常必须进行偏差调整。我们观察到,特权数据(即仅在训练过程中可用的数据)可能使非可忽略选择机制变得可忽略,并将此情况称为特权缺失随机(PMAR)。我们提出了一种新的基于插补的回归方法,名为“重复回归”,适用于PMAR。我们还考虑了一种重要性加权回归方法和两者的双重稳健组合。所提出的方法易于使用大多数流行的现成回归算法实现。我们通过广泛的模拟实验和对合成扩展的实际数据集进行的实证评估,评估了所提出方法的性能。我们得出结论:重复回归可以适当地纠正偏差,并且在外推到响应从未观察到的特征空间区域的情况下,与加权回归相比具有相当大的优势。