The increasing amount of applications of Artificial Intelligence (AI) has led researchers to study the social impact of these technologies and evaluate their fairness. Unfortunately, current fairness metrics are hard to apply in multi-class multi-demographic classification problems, such as Facial Expression Recognition (FER). We propose a new set of metrics to approach these problems. Of the three metrics proposed, two focus on the representational and stereotypical bias of the dataset, and the third one on the residual bias of the trained model. These metrics combined can potentially be used to study and compare diverse bias mitigation methods. We demonstrate the usefulness of the metrics by applying them to a FER problem based on the popular Affectnet dataset. Like many other datasets for FER, Affectnet is a large Internet-sourced dataset with 291,651 labeled images. Obtaining images from the Internet raises some concerns over the fairness of any system trained on this data and its ability to generalize properly to diverse populations. We first analyze the dataset and some variants, finding substantial racial bias and gender stereotypes. We then extract several subsets with different demographic properties and train a model on each one, observing the amount of residual bias in the different setups. We also provide a second analysis on a different dataset, FER+.
翻译:人工智能(AI)应用量的增加促使研究人员研究这些技术的社会影响并评价其公平性。不幸的是,目前的公平度量指标很难应用于多级多人口分类问题,如法西斯表现识别(FER)等。我们提出了一套新的衡量标准来处理这些问题。在提出的三个衡量标准中,有两个侧重于数据集的代表性和陈规定型偏见,第三个侧重于经过培训的模型的剩余偏差。这些衡量标准组合可能被用来研究和比较各种减少偏差的方法。我们通过在流行的Affectnet数据集的基础上将这些衡量标准应用于FER问题来证明这些衡量标准是有用的。与FER的许多其他数据集一样,Affectnet是一大套因特网来源数据集,有291 651个标签图像来处理这些问题。从互联网获取图像使人对受过培训的任何系统是否公平以及它是否能够适当地向不同的人口群进行概括化表示关切。我们首先分析数据集和一些变式,发现实质性的种族偏差和性别陈规定型观念。我们随后提取了几个具有不同人口属性的子集,然后用不同的人口属性来进行不同的人口属性分析,并用一种模型来分析。