Many datasets are biased, namely they contain easy-to-learn features that are highly correlated with the target class only in the dataset but not in the true underlying distribution of the data. For this reason, learning unbiased models from biased data has become a very relevant research topic in the last years. In this work, we tackle the problem of learning representations that are robust to biases. We first present a margin-based theoretical framework that allows us to clarify why recent contrastive losses (InfoNCE, SupCon, etc.) can fail when dealing with biased data. Based on that, we derive a novel formulation of the supervised contrastive loss (epsilon-SupInfoNCE), providing more accurate control of the minimal distance between positive and negative samples. Furthermore, thanks to our theoretical framework, we also propose FairKL, a new debiasing regularization loss, that works well even with extremely biased data. We validate the proposed losses on standard vision datasets including CIFAR10, CIFAR100, and ImageNet, and we assess the debiasing capability of FairKL with epsilon-SupInfoNCE, reaching state-of-the-art performance on a number of biased datasets, including real instances of biases in the wild.
翻译:许多数据集是有偏的,即它们包含了仅在数据集中而非真实基础数据分布中与目标类别高度相关的易学习特征。因此,从偏置数据中学习无偏模型已成为近年来非常相关的研究领域。在这项工作中,我们解决了学习对偏置鲁棒的表示的问题。我们首先提出了一个基于边缘的理论框架,使我们能够澄清为什么最近的对比 loss(InfoNCE,SupCon 等)在处理偏置数据时可能失败。基于此,我们推导出了监督对比 loss 的新公式(epsilon-SupInfoNCE),提供了更加准确的控制正负样本之间最小距离的方法。此外,借助于我们的理论框架,我们还提出了一个新的去偏正则化 loss(* FairKL),即使在极度偏置的数据下也能很好地工作。我们在包括 CIFAR10、CIFAR100 和 ImageNet 在内的标准视觉数据集上验证了所提出的 loss,并评估了 epsilon-SupInfoNCE + FairKL 的去偏功能,在一些有偏数据集上取得了表现优异的结果,包括野外偏置实例。