软搜索:两个数据集,研究研究研究软件的识别和生产</s> (Soft-Search: Two Datasets to Study the Identification and Production of Research Software)

Software is an important tool for scholarly work, but software produced for research is in many cases not easily identifiable or discoverable. A potential first step in linking research and software is software identification. In this paper we present two datasets to study the identification and production of research software. The first dataset contains almost 1000 human labeled annotations of software production from National Science Foundation (NSF) awarded research projects. We use this dataset to train models that predict software production. Our second dataset is created by applying the trained predictive models across the abstracts and project outcomes reports for all NSF funded projects between the years of 2010 and 2023. The result is an inferred dataset of software production for over 150,000 NSF awards. We release the Soft-Search dataset to aid in identifying and understanding research software production: https://github.com/si2-urssi/eager

翻译：软件是学术工作的一个重要工具,但在许多情况下,为研究而生产的软件不容易识别或发现,将研究和软件连接起来的潜在第一步是软件识别。在本文件中,我们提供了两个数据集,以研究研究研究软件的识别和生产。第一个数据集包含来自国家科学基金会(NSF)授予的研究项目的软件生产近1 000个人类标签说明。我们利用这个数据集来培训预测软件生产的模型。我们的第二个数据集是通过在2010年至2023年期间所有NSF资助项目的摘要和项目成果报告中应用经过培训的预测模型而创建的。结果为超过150 000个NSF奖项的软件生产推断数据集。我们发布Soft-Search数据集,以帮助识别和理解研究软件生产:https://github.com/si2-urssi/eager。</s>

相关内容

数据集

关注 0

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【WSDM2020】小数据学习，124页ppt，Learning with Small Data，宾夕法尼亚州立大学

专知会员服务

137+阅读 · 2020年2月6日

Aspect-Oriented Syntax Network for Aspect-Based Sentiment Analysis，中山大学数据科学与计算机学院权小军教授，第八届全国社会媒体处理大会SMP2019

专知会员服务

19+阅读 · 2019年10月22日

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

专知会员服务

163+阅读 · 2019年10月12日