Due to the large amount of daily scientific publications, it is impossible to manually review each one. Therefore, an automatic extraction of key information is desirable. In this paper, we examine STEREO, a tool for extracting statistics from scientific papers using regular expressions. By adapting an existing regular expression inclusion algorithm for our use case, we decrease the number of regular expressions used in STEREO by about $33.8\%$. We reveal common patterns from the condensed rule set that can be used for the creation of new rules. We also apply STEREO, which was previously trained in the life-sciences and medical domain, to a new scientific domain, namely Human-Computer-Interaction (HCI), and re-evaluate it. According to our research, statistics in the HCI domain are similar to those in the medical domain, although a higher percentage of APA-conform statistics were found in the HCI domain. Additionally, we compare extraction on PDF and LaTeX source files, finding LaTeX to be more reliable for extraction.
翻译:由于每日科学出版物数量巨大,因此不可能对每个出版物进行手工审查。 因此, 自动提取关键信息是可取的。 在本文中, 我们检查STEREO, 这是一种用常规表达方式从科学论文中提取统计数据的工具。 通过对现有的常规表达包容算法进行修改, 我们将SEREO中使用的常规表达法减少了约33.8 美元。 我们发现从精密规则集中可以用来创建新规则的常见模式。 我们还将以前在生命科学和医学领域受过培训的STEREO应用到一个新的科学领域, 即人类- 计算机- 互动(HCI), 并重新评价它。 根据我们的研究, HCI 域的统计数据与医疗领域的数据相似, 尽管在HCI 领域发现了更高比例的APA- 兼容性统计数据。 此外, 我们比较了PDF 和 LaTeX 源文件的提取方法, 我们发现 LaTeX 更可靠可以提取。