Colleges and universities use predictive analytics in a variety of ways to increase student success rates. Despite the potential for predictive analytics, two major barriers exist to their adoption in higher education: (a) the lack of democratization in deployment, and (b) the potential to exacerbate inequalities. Education researchers and policymakers encounter numerous challenges in deploying predictive modeling in practice. These challenges present in different steps of modeling including data preparation, model development, and evaluation. Nevertheless, each of these steps can introduce additional bias to the system if not appropriately performed. Most large-scale and nationally representative education data sets suffer from a significant number of incomplete responses from the research participants. While many education-related studies addressed the challenges of missing data, little is known about the impact of handling missing values on the fairness of predictive outcomes in practice. In this paper, we set out to first assess the disparities in predictive modeling outcomes for college-student success, then investigate the impact of imputation techniques on the model performance and fairness using a commonly used set of metrics. We conduct a prospective evaluation to provide a less biased estimation of future performance and fairness than an evaluation of historical data. Our comprehensive analysis of a real large-scale education dataset reveals key insights on modeling disparities and how imputation techniques impact the fairness of the student-success predictive outcome under different testing scenarios. Our results indicate that imputation introduces bias if the testing set follows the historical distribution. However, if the injustice in society is addressed and consequently the upcoming batch of observations is equalized, the model would be less biased.
翻译:学院和大学以多种方式使用预测分析来提高学生成功率。尽管有可能预测分析,但在高等教育中采用这种分析存在两大障碍:(a) 部署方面缺乏民主化,以及(b) 不平等加剧的可能性。教育研究人员和决策者在实践中在部署预测模型时遇到许多挑战。这些在包括数据编制、模型开发和评估等不同建模步骤中存在的挑战。然而,这些步骤中的每一个步骤都可能给系统带来更多的偏差。大多数大型和具有国家代表性的教育数据集都受到研究参与者大量不完整的反馈的影响。虽然许多与教育有关的研究涉及数据缺失的挑战,但对处理缺失的价值对预测结果公正性的影响知之甚少。在本论文中,我们首先评估大学成功与否的预测模型结果方面的差异,然后利用通常使用的一套衡量标准来调查推算模型对模型业绩和公平性的影响。我们进行一项预期评估,对未来业绩和公平性评估的偏差性估计,如果在评估历史数据评估中,对历史结果的正确性影响,那么对学生结果的正确性评估就会减少。