Many machine learning algorithms are trained and evaluated by splitting data from a single source into training and test sets. While such focus on in-distribution learning scenarios has led to interesting advancement, it has not been able to tell if models are relying on dataset biases as shortcuts for successful prediction (e.g., using snow cues for recognising snowmobiles), resulting in biased models that fail to generalise when the bias shifts to a different class. The cross-bias generalisation problem has been addressed by de-biasing training data through augmentation or re-sampling, which are often prohibitive due to the data collection cost (e.g., collecting images of a snowmobile on a desert) and the difficulty of quantifying or expressing biases in the first place. In this work, we propose a novel framework to train a de-biased representation by encouraging it to be different from a set of representations that are biased by design. This tactic is feasible in many scenarios where it is much easier to define a set of biased representations than to define and quantify bias. We demonstrate the efficacy of our method across a variety of synthetic and real-world biases; our experiments show that the method discourages models from taking bias shortcuts, resulting in improved generalisation. Source code is available at https://github.com/clovaai/rebias.
翻译:许多机器学习算法都是通过将数据从单一来源分为培训和测试组来进行训练和评价的。虽然这种对分布式学习情景的这种重视导致了令人感兴趣的进展,但尚不能说明模型是否依赖数据集偏向作为成功预测的捷径(例如,使用雪球标分雪车),从而导致偏向模式,当偏向转向不同的类别时,这种偏向模式无法概括。交叉偏向的概括问题通过通过扩大或再复制来消除偏见的培训数据来解决,由于数据收集费用(例如,收集沙漠中的雪行动图象)以及很难首先量化或表达偏向,这种重点往往令人望而望而却望而却步。在这项工作中,我们提出了一个新颖的框架,通过鼓励这种偏向不同于因设计偏向而偏向的一套表达方式。在很多情况下,确定一套偏向的表达方式比界定和量化偏向性要容易得多。我们展示了我们的方法在各种合成和真实世界的偏向性方面的效力;我们提出的实验表明,从一般的和真实的偏向性角度,我们从现有的偏向上展示了一种偏向的偏向性模型,从而产生偏向性。