Rates of missing data often depend on record-keeping policies and thus may change across times and locations, even when the underlying features are comparatively stable. In this paper, we introduce the problem of Domain Adaptation under Missingness Shift (DAMS). Here, (labeled) source data and (unlabeled) target data would be exchangeable but for different missing data mechanisms. We show that when missing data indicators are available, DAMS can reduce to covariate shift. Focusing on the setting where missing data indicators are absent, we establish the following theoretical results for underreporting completely at random: (i) covariate shift is violated (adaptation is required); (ii) the optimal source predictor can perform worse on the target domain than a constant one; (iii) the optimal target predictor can be identified, even when the missingness rates themselves are not; and (iv) for linear models, a simple analytic adjustment yields consistent estimates of the optimal target parameters. In experiments on synthetic and semi-synthetic data, we demonstrate the promise of our methods when assumptions hold. Finally, we discuss a rich family of future extensions.
翻译:缺失数据的比例往往取决于记录保存政策,因此,即使基本特征相对稳定,也可能会在不同的时间和地点发生变化。本文介绍了在失踪转移(DAMS)下对域进行适应的问题。在这里,(贴标签的)源数据和(未贴标签的)目标数据可以互换,但缺少的数据机制则不同。我们表明,当数据指标缺失时,DAMS可以减少变化。侧重于缺失数据指标的设定,我们为完全随机漏报制定了以下理论结果:(一) 共变换(需要调整);(二) 最佳源预测器在目标域的性能比恒定的更差;(三) 最佳目标预测器可以确定,即使缺失率本身并不存在;(四) 对于线性模型,简单的分析调整可以得出最佳目标参数的一致估计。在对合成和半合成数据的实验中,我们在假设时展示了我们方法的希望。最后,我们讨论了未来扩展的丰富系列。