National research evaluation initiatives and incentive schemes have previously chosen between simplistic quantitative indicators and time-consuming peer review, sometimes supported by bibliometrics. Here we assess whether artificial intelligence (AI) could provide a third alternative, estimating article quality using more multiple bibliometric and metadata inputs. We investigated this using provisional three-level REF2021 peer review scores for 84,966 articles submitted to the UK Research Excellence Framework 2021, matching a Scopus record 2014-18 and with a substantial abstract. We found that accuracy is highest in the medical and physical sciences Units of Assessment (UoAs) and economics, reaching 42% above the baseline (72% overall) in the best case. This is based on 1000 bibliometric inputs and half of the articles used for training in each UoA. Prediction accuracies above the baseline for the social science, mathematics, engineering, arts, and humanities UoAs were much lower or close to zero. The Random Forest Classifier (standard or ordinal) and Extreme Gradient Boosting Classifier algorithms performed best from the 32 tested. Accuracy was lower if UoAs were merged or replaced by Scopus broad categories. We increased accuracy with an active learning strategy and by selecting articles with higher prediction probabilities, as estimated by the algorithms, but this substantially reduced the number of scores predicted.
翻译:国家研究评估举措和奖励计划以前是在简单量化指标和费时的同行审查之间选择的,有时还辅以生物量度。这里我们评估人工智能(AI)能否提供第三种选择,利用更多多种生物量度和元数据投入估算文章质量。我们使用临时三级REF2021同行审查评分来调查提交英国2021年研究优异框架的84 966篇文章,将Scopus2014-2018年记录与相当抽象相匹配。我们发现,医学和物理科学评估单位(UoAs)和经济学的准确性最高,在最佳情况下比基线(总体72%)高出42%。这是基于1,000个生物量度投入和用于每个UoA培训的一半物品。预测值高于社会科学、数学、工程、艺术和人文学UoAs基准的84 966分数,远远低于或接近于零。我们发现,在经过测试的32个基准(标准或正值)和极分级级评估中,最优的分级算值最高,在经过测试的32个测试后达到最优级(总体72%)达到42%以上。如果将UA级的精度,则通过高级的精度选择,则通过高级的精度预测,则以高级计算法将精度和高级算取为精度的精度的精度选择,则以高级的精度的精度的精度的精度的精度计算值,则以高分数与高的精度计算取代。