Source code is essential for researchers to reproduce the methods and replicate the results of artificial intelligence (AI) papers. Some organizations and researchers manually collect AI papers with available source code to contribute to the AI community. However, manual collection is a labor-intensive and time-consuming task. To address this issue, we propose a method to automatically identify papers with available source code and extract their source code repository URLs. With this method, we find that 20.5% of regular papers of 10 top AI conferences published from 2010 to 2019 are identified as papers with available source code and that 8.1% of these source code repositories are no longer accessible. We also create the XMU NLP Lab README Dataset, the largest dataset of labeled README files for source code document research. Through this dataset, we have discovered that quite a few README files have no installation instructions or usage tutorials provided. Further, a large-scale comprehensive statistical analysis is made for a general picture of the source code of AI conference papers. The proposed solution can also go beyond AI conference papers to analyze other scientific papers from both journals and conferences to shed light on more domains.
翻译:研究人员复制人工智能(AI)文件的方法和复制其结果必须遵循源代码。一些组织和研究人员手工收集带有现有源代码的AI文件,以便为AI社区做出贡献。然而,人工收集是一项劳动密集型和耗时的任务。为解决这一问题,我们建议了一种方法,自动识别有源代码的文件,并提取其源代码储存的URL。采用这种方法,我们发现2010年至2019年出版的10个AI最高级会议的常规文件中有20.5%被确定为有源代码的文件,而这些源代码储存库中有8.1%不再可用。我们还创建了XMU NLP Lab README Dataset,这是为源代码文件研究而贴有标签的README文件的最大数据集。我们通过这一数据集发现,相当少的README文件没有提供安装指令或使用指导。此外,还针对AI会议文件的源代码总图进行了大规模综合统计分析。拟议的解决方案还可以超越AI会议文件的范围,从期刊和会议的角度分析其他科学文件,以便了解更多的领域。