While machine learning models rapidly advance the state-of-the-art on various real-world tasks, out-of-domain (OOD) generalization remains a challenging problem given the vulnerability of these models to spurious correlations. We propose a balanced mini-batch sampling strategy to transform a biased data distribution into a spurious-free balanced distribution, based on the invariance of the underlying causal mechanisms for the data generation process. We argue that the Bayes optimal classifiers trained on such balanced distribution are minimax optimal across a diverse enough environment space. We also provide an identifiability guarantee of the latent variable model of the proposed data generation process, when utilizing enough train environments. Experiments are conducted on DomainBed, demonstrating empirically that our method obtains the best performance across 20 baselines reported on the benchmark.
翻译:在机器学习模型迅速推进关于各种现实世界任务的最新技术的同时,由于这些模型容易发生虚假的相关性,外部(OOD)的概括化仍然是一个具有挑战性的问题。我们建议采取平衡的小型抽样战略,根据数据生成过程内在因果关系机制的变动,将有偏差的数据分配转化为虚假的、无偏差的均衡分布。我们认为,在这种均衡分布方面受过培训的贝耶斯最佳分类人员在足够多的环境空间中是最佳的。我们还在利用足够的火车环境时,为拟议数据生成过程的潜在可变模型提供了可辨性保证。在DomainBed上进行了实验,从经验上表明,我们的方法在基准上报告的20个基线上取得了最佳业绩。