Machine learning (ML) is playing an increasingly important role in rendering decisions that affect a broad range of groups in society. ML models inform decisions in criminal justice, the extension of credit in banking, and the hiring practices of corporations. This posits the requirement of model fairness, which holds that automated decisions should be equitable with respect to protected features (e.g., gender, race, or age) that are often under-represented in the data. We postulate that this problem of under-representation has a corollary to the problem of imbalanced data learning. This class imbalance is often reflected in both classes and protected features. For example, one class (those receiving credit) may be over-represented with respect to another class (those not receiving credit) and a particular group (females) may be under-represented with respect to another group (males). A key element in achieving algorithmic fairness with respect to protected groups is the simultaneous reduction of class and protected group imbalance in the underlying training data, which facilitates increases in both model accuracy and fairness. We discuss the importance of bridging imbalanced learning and group fairness by showing how key concepts in these fields overlap and complement each other; and propose a novel oversampling algorithm, Fair Oversampling, that addresses both skewed class distributions and protected features. Our method: (i) can be used as an efficient pre-processing algorithm for standard ML algorithms to jointly address imbalance and group equity; and (ii) can be combined with fairness-aware learning algorithms to improve their robustness to varying levels of class imbalance. Additionally, we take a step toward bridging the gap between fairness and imbalanced learning with a new metric, Fair Utility, that combines balanced accuracy with fairness.
翻译:机器学习(ML)在做出影响社会广泛群体的决策方面发挥着越来越重要的作用。 ML模式为刑事司法、银行信贷扩展和公司雇用做法方面的决策提供了依据。这要求模型公平,认为自动决定对于数据中往往代表不足的受保护特征(如性别、种族或年龄)而言应当是公平的。我们假设,这种代表性不足的问题是数据学习不平衡问题的必然结果。这种阶级不平衡现象常常反映在班级和受保护的特点中。例如,一个阶级(获得信贷)在刑事司法、银行信贷和公司雇用做法方面的决策中的比例可能过高,而另一个阶级(得不到信贷)和某个特定群体(女性)在模型公平方面的比例可能偏低,这要求自动决定对于数据中往往代表不足的受保护特征(例如性别、种族或年龄)来说应该是公平。我们假设,在基础培训数据中同时减少阶级和受保护群体不平衡现象,这有利于提高模型准确性和公平性。我们讨论了平衡学习和群体公平性的重要性,方法是展示这些领域的关键概念的重叠和互补;提出一种新结构分配方式,即我们使用一种新的方法,可以使我们的公平性更接近于公平性。