Many datasets are biased, namely they contain easy-to-learn features that are highly correlated with the target class only in the dataset but not in the true underlying distribution of the data. For this reason, learning unbiased models from biased data has become a very relevant research topic in the last years. In this work, we tackle the problem of learning representations that are robust to biases. We first present a margin-based theoretical framework that allows us to clarify why recent contrastive losses (InfoNCE, SupCon, etc.) can fail when dealing with biased data. Based on that, we derive a novel formulation of the supervised contrastive loss (epsilon-SupInfoNCE), providing more accurate control of the minimal distance between positive and negative samples. Furthermore, thanks to our theoretical framework, we also propose FairKL, a new debiasing regularization loss, that works well even with extremely biased data. We validate the proposed losses on standard vision datasets including CIFAR10, CIFAR100, and ImageNet, and we assess the debiasing capability of FairKL with epsilon-SupInfoNCE, reaching state-of-the-art performance on a number of biased datasets, including real instances of biases in the wild.
翻译:许多数据集存在偏差,即它们只包含数据集中与目标类别高度相关的容易读取的特征,而数据的真正基本分布却并非如此。因此,从偏差数据中学习不带偏见的模型在过去几年中已成为一个非常相关的研究课题。在这项工作中,我们处理学习对偏差有强烈影响的表述问题。我们首先提出了一个基于边际的理论框架,使我们能够澄清为什么在处理偏差数据时,最近的对比性损失(InfoNCE、SupCon等)可能失败。在此基础上,我们提出了监督对比性损失(epsilon-SupInfONCE)的新配方,更准确地控制正数和负数样本之间的最小距离。此外,由于我们的理论框架,我们还提出了FairKL, 一种新的偏差性规范损失,即使数据极有偏差,也很好地起作用。我们验证了在标准视觉数据集(包括CIFAR10、CIFAR100和图像网络)上拟议的损失。我们评估了FairKL的偏差性能力,包括真实的偏差性数据,从而达到真实的偏差性分析的状态。