We encounter variables with little variation often in educational data mining (EDM) and discipline-based education research (DBER) due to the demographics of higher education and the questions we ask. Yet, little work has examined how to analyze such data. Therefore, we conducted a simulation study using logistic regression, penalized regression, and random forest. We systematically varied the fraction of positive outcomes, feature imbalances, and odds ratios. We find the algorithms treat features with the same odds ratios differently based on the features' imbalance and the outcome imbalance. While none of the algorithms fully solved the problem, penalized approaches such as Firth and Log-F reduced the scale of the problem. Our results suggest that EDM and DBER studies might contain false negatives when determining which variables are related to an outcome. We then apply our findings to a graduate admissions data set and we propose recommendations for researchers working with the kind of imbalanced data common to EDM and DBER studies.
翻译:由于高等教育的人口统计和我们提出的问题,我们在教育数据挖掘(EDM)和基于纪律的教育研究(DBER)中经常遇到变化很少的变量。然而,几乎没有研究过如何分析这些数据。因此,我们利用后勤回归、抑制回归和随机森林进行了模拟研究。我们系统地将积极结果的分数、特征失衡和概率比进行了差异。我们发现算法根据特征的不平衡和结果不平衡,对相同差数比率的特征进行了不同的处理。虽然没有一种算法完全解决问题,但Firth和Log-F等受罚的方法缩小了问题的规模。我们的结果表明,EDM和DBER研究在确定哪些变量与结果相关时可能包含虚假的负值。我们然后将我们的研究结果应用于研究生入学数据集,并为研究EDM和DBER研究所共有的不平衡数据研究人员提出建议。