Many solutions have been provided to extract bibliographic references from PDF papers. Machine learning, rule-based and regular expressions approaches were among the most used methods adopted in tools for addressing this task. This work aims to identify and evaluate all and only the tools which, given a full-text paper in PDF format, can recognise, extract and parse bibliographic references. We identified seven tools: Anystyle, Cermine, ExCite, Grobid, Pdfssa4met, Scholarcy and Science Parse. We compared and evaluated them against a corpus of 56 PDF articles published in 27 subject areas. Indeed, Anystyle obtained the best overall score, followed by Cermine. However, in some subject areas, other tools had better results for specific tasks.
翻译:为从PDF文件中提取文献参考材料提供了许多解决办法。机械学习、基于规则和定期表达方式是用于完成这项任务的工具中最常用的方法之一。这项工作旨在确定和评价所有工具,而且只有以PDF格式的全文文件能够识别、提取和分析文献参考材料的工具。我们确定了7个工具:任何型、铜、ExCite、Grobid、Pdfssa4met、Pdfssa4met、学者和科学分析。我们对照在27个主题领域发表的56篇PDF文章对这些工具进行了比较和评价。事实上,任何型都获得了最好的总分,Cermine随后是Cermine。然而,在一些主题领域,其他工具在具体任务方面有更好的结果。