The feature subset selection problem aims at selecting the relevant subset of features to improve the performance of a Machine Learning (ML) algorithm on training data. Some features in data can be inherently noisy, costly to compute, improperly scaled, or correlated to other features, and they can adversely affect the accuracy, cost, and complexity of the induced algorithm. The goal of traditional feature selection approaches has been to remove such irrelevant features. In recent years ML is making a noticeable impact on the decision-making processes of our everyday lives. We want to ensure that these decisions do not reflect biased behavior towards certain groups or individuals based on protected attributes such as age, sex, or race. In this paper, we present a feature subset selection approach that improves both fairness and accuracy objectives and computes Pareto-optimal solutions using the NSGA-II algorithm. We use statistical disparity as a fairness metric and F1-Score as a metric for model performance. Our experiments on the most commonly used fairness benchmark datasets with three different machine learning algorithms show that using the evolutionary algorithm we can effectively explore the trade-off between fairness and accuracy.
翻译:特性子集选择问题旨在选择相关特征的子集,以改善培训数据机器学习算法的性能。数据的某些特征本身可能很吵,计算费用昂贵,计算费用过高,规模不适当,或与其他特征相关,而且可能对引算算法的准确性、成本和复杂性产生不利影响。传统特征选择方法的目标是消除这些不相干的特点。近年来,ML正在对我们日常生活的决策进程产生显著影响。我们希望确保这些决定不反映基于年龄、性别或种族等受保护属性对某些群体或个人的偏向行为。在本文中,我们提出一个特征子集选择方法,既能提高公平性和准确性,又能利用NSGA-II算法计算Pareto最佳解决办法。我们用统计差异作为公平度和F1-Score作为模型业绩的衡量标准。我们用三种不同机器学习算法对最常用的公平基准数据集进行的实验表明,我们可以用进化算法有效地探索公平与准确性之间的交易。