用于偏差缓解的进化样本权重：有效性取决于优化目标 (Evolved SampleWeights for Bias Mitigation: Effectiveness Depends on Optimization Objectives)

Machine learning models trained on real-world data may inadvertently make biased predictions that negatively impact marginalized communities. Reweighting is a method that can mitigate such bias in model predictions by assigning a weight to each data point used during model training. In this paper, we compare three methods for generating these weights: (1) evolving them using a Genetic Algorithm (GA), (2) computing them using only dataset characteristics, and (3) assigning equal weights to all data points. Model performance under each strategy was evaluated using paired predictive and fairness metrics, which also served as optimization objectives for the GA during evolution. Specifically, we used two predictive metrics (accuracy and area under the Receiver Operating Characteristic curve) and two fairness metrics (demographic parity difference and subgroup false negative fairness). Using experiments on eleven publicly available datasets (including two medical datasets), we show that evolved sample weights can produce models that achieve better trade-offs between fairness and predictive performance than alternative weighting methods. However, the magnitude of these benefits depends strongly on the choice of optimization objectives. Our experiments reveal that optimizing with accuracy and demographic parity difference metrics yields the largest number of datasets for which evolved weights are significantly better than other weighting strategies in optimizing both objectives.

翻译：基于现实世界数据训练的机器学习模型可能无意中做出带有偏见的预测，从而对边缘化群体产生负面影响。重加权是一种通过在模型训练期间为每个数据点分配权重来缓解此类预测偏差的方法。本文比较了三种生成这些权重的方法：(1)使用遗传算法进化权重，(2)仅基于数据集特征计算权重，(3)为所有数据点分配相等权重。每种策略下的模型性能均通过配对的预测指标和公平性指标进行评估，这些指标也作为遗传算法进化过程中的优化目标。具体而言，我们使用了两种预测指标（准确率和受试者工作特征曲线下面积）和两种公平性指标（人口统计均等差异和子组假阴性公平性）。通过对十一个公开数据集（包括两个医学数据集）的实验，我们证明进化样本权重能够产生比替代加权方法在公平性与预测性能之间取得更好权衡的模型。然而，这些优势的程度在很大程度上取决于优化目标的选择。实验表明，使用准确率和人口统计均等差异指标进行优化时，进化权重在同时优化这两个目标方面显著优于其他加权策略的数据集数量最多。

相关内容

数据集

关注 0

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【CVPR2024】DiffusionMTL: 从部分标注数据学习多任务去噪扩散模型

专知会员服务

34+阅读 · 2024年3月25日

【ICML2023】SEGA:结构熵引导的图对比学习锚视图

专知会员服务

22+阅读 · 2023年5月10日

【CMU-Yuejie Chi等干货书】满足低秩矩阵分解的非凸优化综述，69页pdf，Nonconvex Optimization Meets Low-Rank Matrix Factorization: An Overview

专知会员服务

33+阅读 · 2022年3月4日

语义相似性算法演化论文，29页pdf，Evolution of Semantic Similarity - A Survey

专知会员服务

44+阅读 · 2020年4月30日