Dataset bias is a critical challenge in machine learning, and its negative impact is aggravated when models capture unintended decision rules with spurious correlations. Although existing works often handle this issue using human supervision, the availability of the proper annotations is impractical and even unrealistic. To better tackle this challenge, we propose a simple but effective debiasing technique in an unsupervised manner. Specifically, we perform clustering on the feature embedding space and identify pseudoattributes by taking advantage of the clustering results even without an explicit attribute supervision. Then, we employ a novel cluster-based reweighting scheme for learning debiased representation; this prevents minority groups from being discounted for minimizing the overall loss, which is desirable for worst-case generalization. The extensive experiments demonstrate the outstanding performance of our approach on multiple standard benchmarks, which is even as competitive as the supervised counterpart.
翻译:在机器学习中,数据集偏差是一个关键的挑战,当模型捕捉出与虚假相关联的意外决策规则时,数据偏差会加重其负面影响。虽然现有工作往往利用人的监督来处理这个问题,但适当的说明不切实际,甚至不切实际。为了更好地应对这一挑战,我们建议一种简单而有效的偏向技术,不以无人监督的方式处理。具体地说,我们利用集群结果对嵌入空间的特征进行分组,并找出伪因子。然后,我们采用一种新的基于集群的重新加权计划来学习贬低代表制;这妨碍了少数群体被打折扣,以尽量减少总体损失,而这是最坏情况一般化的可取做法。广泛的实验表明,我们在多种标准基准上的做法表现出色,即使与受监督的对应方具有竞争力。