Imbalanced datasets are commonplace in modern machine learning problems. The presence of under-represented classes or groups with sensitive attributes results in concerns about generalization and fairness. Such concerns are further exacerbated by the fact that large capacity deep nets can perfectly fit the training data and appear to achieve perfect accuracy and fairness during training, but perform poorly during test. To address these challenges, we propose AutoBalance, a bi-level optimization framework that automatically designs a training loss function to optimize a blend of accuracy and fairness-seeking objectives. Specifically, a lower-level problem trains the model weights, and an upper-level problem tunes the loss function by monitoring and optimizing the desired objective over the validation data. Our loss design enables personalized treatment for classes/groups by employing a parametric cross-entropy loss and individualized data augmentation schemes. We evaluate the benefits and performance of our approach for the application scenarios of imbalanced and group-sensitive classification. Extensive empirical evaluations demonstrate the benefits of AutoBalance over state-of-the-art approaches. Our experimental findings are complemented with theoretical insights on loss function design and the benefits of train-validation split. All code is available open-source.
翻译:现代机器学习问题中普遍存在的数据集不平衡现象,在现代机器学习问题中,代表性不足的班级或具有敏感属性的群体的存在导致对普遍化和公平性的关切,这种关切因大型能力深网完全适合培训数据,在培训期间似乎达到完全准确和公正,但在测试期间表现不佳而进一步加剧,因而进一步加剧。为了应对这些挑战,我们提议采用双层优化框架“自动平衡”,即双层优化框架,自动设计培训损失功能,优化准确性和寻求公平性的目标的组合。具体地说,一个较低层次的问题使模型的重量发生,高层次的问题通过监测和优化验证数据的预期目标来调节损失功能。我们的损失设计通过使用准作物流失损失和个性化数据强化计划,使对各班级/组进行个化处理成为个性化办法。我们评估我们应用不平衡和群体敏感分类办法的效益和绩效。广泛的实证评估表明“自动平衡”优于“状态”方法的效益。我们实验发现,对损失功能设计和火车估价分算法的好处得到了补充。所有代码都是开放的源。