从已出版的软件包中提取知识 (Scholarly Knowledge Extraction from Published Software Packages)

A plethora of scientific software packages are published in repositories, e.g., Zenodo and figshare. These software packages are crucial for the reproducibility of published research. As an additional route to scholarly knowledge graph construction, we propose an approach for automated extraction of machine actionable (structured) scholarly knowledge from published software packages by static analysis of their (meta)data and contents (in particular scripts in languages such as Python). The approach can be summarized as follows. First, we extract metadata information (software description, programming languages, related references) from software packages by leveraging the Software Metadata Extraction Framework (SOMEF) and the GitHub API. Second, we analyze the extracted metadata to find the research articles associated with the corresponding software repository. Third, for software contained in published packages, we create and analyze the Abstract Syntax Tree (AST) representation to extract information about the procedures performed on data. Fourth, we search the extracted information in the full text of related articles to constrain the extracted information to scholarly knowledge, i.e. information published in the scholarly literature. Finally, we publish the extracted machine actionable scholarly knowledge in the Open Research Knowledge Graph (ORKG).

翻译：大量科学软件包在储存库中公布,例如Zenodo和dugshare。这些软件包对于复制已出版的研究至关重要。作为学术知识图表建设的又一条途径,我们提出一种办法,通过静态分析已出版的软件包中的(元)数据和内容(特别是Python等语言的脚本),从这些软件包中自动提取可操作(结构)的学术知识。该办法可以概括如下:首先,我们利用软件元数据提取框架和GitHub API,从软件包中提取元数据信息(软件描述、程序制作语言、相关参考资料)。第二,我们分析提取的元数据,以找到与相应软件库相关的研究文章。第三,对于已出版的软件包中所含软件,我们创建并分析摘要语系,以获取关于数据程序的信息。第四,我们从有关条款全文中提取的信息(软件描述、编程语言、相关参考资料),以将提取的信息限制于学术知识,即学术文献中公布的信息。最后,我们出版可提取的可移动的计算机搜索的学术知识。最后,我们出版了可移动的计算机。