Software is an important tool for scholarly work, but software produced for research is in many cases not easily identifiable or discoverable. A potential first step in linking research and software is software identification. In this paper we present two datasets to study the identification and production of research software. The first dataset contains almost 1000 human labeled annotations of software production from National Science Foundation (NSF) awarded research projects. We use this dataset to train models that predict software production. Our second dataset is created by applying the trained predictive models across the abstracts and project outcomes reports for all NSF funded projects between the years of 2010 and 2023. The result is an inferred dataset of software production for over 150,000 NSF awards. We release the Soft-Search dataset to aid in identifying and understanding research software production: https://github.com/si2-urssi/eager
翻译:软件是学术工作的一个重要工具,但在许多情况下,为研究而生产的软件不容易识别或发现,将研究和软件连接起来的潜在第一步是软件识别。在本文件中,我们提供了两个数据集,以研究研究研究软件的识别和生产。第一个数据集包含来自国家科学基金会(NSF)授予的研究项目的软件生产近1 000个人类标签说明。我们利用这个数据集来培训预测软件生产的模型。我们的第二个数据集是通过在2010年至2023年期间所有NSF资助项目的摘要和项目成果报告中应用经过培训的预测模型而创建的。结果为超过150 000个NSF奖项的软件生产推断数据集。我们发布Soft-Search数据集,以帮助识别和理解研究软件生产:https://github.com/si2-urssi/eager。</s>