我们能否利用半监督学习实现公平? (Can We Achieve Fairness Using Semi-Supervised Learning?)

Ethical bias in machine learning models has become a matter of concern in the software engineering community. Most of the prior software engineering works concentrated on finding ethical bias in models rather than fixing it. After finding bias, the next step is mitigation. Prior researchers mainly tried to use supervised approaches to achieve fairness. However, in the real world, getting data with trustworthy ground truth is challenging and also ground truth can contain human bias. Semi-supervised learning is a machine learning technique where, incrementally, labeled data is used to generate pseudo-labels for the rest of data (and then all that data is used for model training). In this work, we apply four popular semi-supervised techniques as pseudo-labelers to create fair classification models. Our framework, Fair-SSL, takes a very small amount (10\%) of labeled data as input and generates pseudo-labels for the unlabeled data. We then synthetically generate new data points to balance the training data based on class and protected attribute as proposed by Chakraborty et al. in FSE 2021. Finally, the classification model is trained on the balanced pseudo-labeled data and validated on test data. After experimenting on ten datasets and three learners, we find that Fair-SSL achieves similar performance as three state-of-the-art bias mitigation algorithms. That said, the clear advantage of Fair-SSL is that it requires only 10\% of the labeled training data. To the best of our knowledge, this is the first SE work where semi-supervised techniques are used to fight against ethical bias in SE ML models.

翻译：机器学习模型中的道德偏向已成为软件工程界关注的一个问题。大多数先前的软件工程工作都集中在寻找模型中的道德偏向而不是修复模型。在找到偏向之后,下一步是减缓。以前的研究人员主要试图使用监督的方法来实现公平。但是,在现实世界中,获得具有可信赖地面真理的数据是具有挑战性的,而且实地真理也可以包含人类偏见。半监督学习是一种机器学习技术,在这个技术中,标签数据被用来为其余数据生成假标签(然后所有数据都用于模型培训 ) 。在这项工作中,我们应用四种受欢迎的半监督的半监督性技术作为假标签来创建公平的分类模型。我们的框架,公平SSL, 以非常小的数量(10个)的标签数据作为投入,并为未贴标签的数据生成假标签标签标签标签。然后,我们合成了新的数据点,以平衡基于Chakraborbortty 等人在FSEE 2021中提议的课堂和保护属性的培训数据数据(然后所有数据都用于模型的模型)。最后, 分类模型是关于平衡的准伪标签的半监督技术作为假标签的半透明的模型, 在SEEEL 10 的测试中, 我们的测试的3个测试数据中, 的测试中, 的测试中,我们所使用的3个测试数据是用来在SEL 的模型中找到了这个测试数据, 10 的精确的正确的数据, 。