Learning invariant representations is an important requirement when training machine learning models that are driven by spurious correlations in the datasets. These spurious correlations, between input samples and the target labels, wrongly direct the neural network predictions resulting in poor performance on certain groups, especially the minority groups. Robust training against these spurious correlations requires the knowledge of group membership for every sample. Such a requirement is impractical in situations where the data labeling efforts for minority or rare groups are significantly laborious or where the individuals comprising the dataset choose to conceal sensitive information. On the other hand, the presence of such data collection efforts results in datasets that contain partially labeled group information. Recent works have tackled the fully unsupervised scenario where no labels for groups are available. Thus, we aim to fill the missing gap in the literature by tackling a more realistic setting that can leverage partially available sensitive or group information during training. First, we construct a constraint set and derive a high probability bound for the group assignment to belong to the set. Second, we propose an algorithm that optimizes for the worst-off group assignments from the constraint set. Through experiments on image and tabular datasets, we show improvements in the minority group's performance while preserving overall aggregate accuracy across groups.
翻译:当培训机器学习模型时,由于数据集中虚假的关联而引发的这些输入样本和目标标签之间的这些虚假关联,错误地引导神经网络预测导致某些群体,特别是少数群体的表现不佳。针对这些虚假关联的有力培训要求每个样本都有群体成员的知识。这种要求在为少数群体或稀有群体进行数据标签工作非常困难或组成数据集的个人选择隐藏敏感信息的情况下是不切实际的。另一方面,这种数据收集工作导致包含部分标签群体信息的数据集。最近的工作解决了完全不受监督的情景,没有为群体提供标签。因此,我们的目标是填补文献中缺失的空白,处理一个更现实的环境,在培训期间利用部分可用的敏感或群体信息。首先,我们设置了一个制约套套,并得出了属于一组任务的高概率。第二,我们提出一种算法,优化从组群任务中最坏的任务,在组群任务中含有部分标有标签的群信息。最近的工作已经解决了完全不受监督的情景,没有为群体标签。因此,我们的目标是通过在总体数据组合中进行绩效实验,同时在维护少数群体的图像和列表中显示总体数据的改进。