This project aims to optimize the linguistic indexing of the OpenAlex database by comparing the performance of various Python-based language identification procedures on different metadata corpora extracted from a manually-annotated article sample. The precision and recall performance of each algorithm, corpus, and language is first analyzed, followed by an assessment of processing speeds recorded for each algorithm and corpus type. These different performance measures are then simulated at the database level using probabilistic confusion matrices for each algorithm, corpus, and language, as well as a probabilistic modeling of relative article language frequencies for the whole OpenAlex database. Results show that procedure performance strongly depends on the importance given to each of the measures implemented: for contexts where precision is preferred, using the LangID algorithm on the greedy corpus gives the best results; however, for all cases where recall is considered at least slightly more important than precision or as soon as processing times are given any kind of consideration, the procedure that consists in the application of the FastSpell algorithm on the Titles corpus outperforms all other alternatives. Given the lack of truly multilingual large-scale bibliographic databases, it is hoped that these results help confirm and foster the unparalleled potential of the OpenAlex database for cross-linguistic and comprehensive measurement and evaluation.
翻译:本项目旨在通过比较多种基于Python的语言识别程序在从手动标注文章样本中提取的不同元数据语料上的性能,以优化OpenAlex数据库的语言索引。首先分析了每种算法、语料和语言的精确率与召回率表现,随后评估了各算法与语料类型记录的处理速度。接着,利用每种算法、语料和语言的概率混淆矩阵,以及对整个OpenAlex数据库文章语言相对频率的概率建模,在数据库层面对这些不同性能指标进行了模拟。结果表明,程序性能在很大程度上取决于对各项实施指标的重视程度:在优先考虑精确率的场景中,在贪婪语料上使用LangID算法效果最佳;然而,在所有召回率至少略重于精确率或一旦考虑处理时间的情况下,在标题语料上应用FastSpell算法的程序表现优于所有其他方案。鉴于目前缺乏真正多语言的大规模文献数据库,希望这些结果有助于确认并促进OpenAlex数据库在跨语言全面测量与评估方面无可比拟的潜力。