Due to the large amount of daily scientific publications, it is impossible to manually review each one. Therefore, an automatic extraction of key information is desirable. In this paper, we examine STEREO, a tool for extracting statistics from scientific papers using regular expressions. By adapting an existing regular expression inclusion algorithm for our use case, we decrease the number of regular expressions used in STEREO by about $33.8\%$. We reveal common patterns from the condensed rule set that can be used for the creation of new rules. We also apply STEREO, which was previously trained in the life-sciences and medical domain, to a new scientific domain, namely Human-Computer-Interaction (HCI), and re-evaluate it. According to our research, statistics in the HCI domain are similar to those in the medical domain, although a higher percentage of APA-conform statistics were found in the HCI domain. Additionally, we compare extraction on PDF and LaTeX source files, finding LaTeX to be more reliable for extraction.
翻译:由于每天发表的科学出版物数量庞大,手工审核每篇论文是不可能的。因此,自动提取关键信息是可取的。在本文中,我们研究了用正则表达式从科学论文中提取统计信息的工具STEREO。通过为我们的用例调整现有的正则表达式包含算法,我们将STEREO使用的正则表达式数量减少约33.8%。我们从压缩规则集中揭示了常见模式,可用于创建新规则。我们还将先前针对生命科学和医学领域进行训练的STEREO应用于新的科学领域,即人机交互(HCI),并进行重新评估。根据我们的研究,HCI领域的统计数据与医学领域类似,尽管在HCI领域中发现了更高比例的APA合规统计数据。此外,我们比较了PDF和LaTeX源文件的提取效果,发现LaTeX在提取方面更可靠。