While machine learning models rapidly advance the state-of-the-art on various real-world tasks, out-of-domain (OOD) generalization remains a challenging problem given the vulnerability of these models to spurious correlations. While current domain generalization methods usually focus on enforcing certain invariance properties across different domains by new loss function designs, we propose a balanced mini-batch sampling strategy to reduce the domain-specific spurious correlations in the observed training distributions. More specifically, we propose a two-phased method that 1) identifies the source of spurious correlations, and 2) builds balanced mini-batches free from spurious correlations by matching on the identified source. We provide an identifiability guarantee of the source of spuriousness and show that our proposed approach provably samples from a balanced, spurious-free distribution over all training environments. Experiments are conducted on three computer vision datasets with documented spurious correlations, demonstrating empirically that our balanced mini-batch sampling strategy improves the performance of four different established domain generalization model baselines compared to the random mini-batch sampling strategy.
翻译:虽然机器学习模型迅速推进了各种现实世界任务的最新技术,但是由于这些模型容易产生虚假的关联,外部(OOD)的概括化仍然是一个具有挑战性的问题。虽然当前域的概括化方法通常侧重于通过新的损失函数设计在不同领域执行某些差异性,但我们提出了一个平衡的小型批量抽样战略,以减少所观察到的培训分布中特定领域虚假的关联。更具体地说,我们提议了一个两阶段方法,即(1)确定虚假关联的来源;(2)通过匹配所查明的来源,建立平衡的微型桶,避免虚假的关联。我们提供了虚假性来源的识别性保证,并表明我们所提议的方法样本来自所有培训环境的均衡、无虚假分布。我们对所有培训环境进行了实验,三个计算机视觉数据集有文件记载的虚假关联。我们平衡的小型批量抽样战略与随机的小型批量抽样战略相比,改善了四个不同既定域通用基线的绩效。