大学辍学预测模型是否应包括受保护的属性? (Should College Dropout Prediction Models Include Protected Attributes?)

Early identification of college dropouts can provide tremendous value for improving student success and institutional effectiveness, and predictive analytics are increasingly used for this purpose. However, ethical concerns have emerged about whether including protected attributes in the prediction models discriminates against underrepresented student groups and exacerbates existing inequities. We examine this issue in the context of a large U.S. research university with both residential and fully online degree-seeking students. Based on comprehensive institutional records for this entire student population across multiple years, we build machine learning models to predict student dropout after one academic year of study, and compare the overall performance and fairness of model predictions with or without four protected attributes (gender, URM, first-generation student, and high financial need). We find that including protected attributes does not impact the overall prediction performance and it only marginally improves algorithmic fairness of predictions. While these findings suggest that including protected attributes is preferred, our analysis also offers guidance on how to evaluate the impact in a local context, where institutional stakeholders seek to leverage predictive analytics to support student success.

翻译：早期识别大学辍学学生可为提高学生成功和体制效力提供巨大价值,预测性分析也越来越多地用于这一目的。然而,在将受保护的属性纳入预测模型是否歧视代表性不足的学生群体并加剧现有的不公平现象方面,道德问题已经出现。我们在一个大型美国研究大学中研究这一问题,该大学既有住院学生,也有全在线学位求学学生。根据多年来整个学生群体的综合机构记录,我们建立机器学习模型,预测一年学习后的学生辍学情况,并将模型预测的总体性能和公平性与四个受保护属性(性别、URM、第一代学生和高财政需求)进行比较。我们发现,列入受保护的属性不会影响总体预测业绩,只能略微改善预测的算法公正性。虽然这些研究结果表明最好包括受保护属性,但我们的分析也就如何评估当地影响提供指导,因为机构利益攸关方寻求利用预测性分析方法支持学生的成功。