Almost every Mining Software Repositories (MSR) study requires, as first step, the selection of the subject software repositories. These repositories are usually collected from hosting services like GitHub using specific selection criteria dictated by the study goal. For example, a study related to licensing might be interested in selecting projects explicitly declaring a license. Once the selection criteria have been defined, utilities such as the GitHub APIs can be used to "query" the hosting service. However, researchers have to deal with usage limitations imposed by these APIs and a lack of required information. For example, the GitHub search APIs allow 30 requests per minute and, when searching repositories, only provide limited information (e.g., the number of commits in a repository is not included). To support researchers in sampling projects from GitHub, we present GHS (GitHub Search), a dataset containing 25 characteristics (e.g., number of commits, license, etc.) of 735,669 repositories written in 10 programming languages. The set of characteristics has been derived by looking for frequently used project selection criteria in MSR studies and the dataset is continuously updated to (i) always provide fresh data about the existing projects, and (ii) increase the number of indexed projects. The GHS dataset can be queried through a web application we built that allows to set many combinations of selection criteria needed for a study and download the information of matching repositories: https://seart-ghs.si.usi.ch.
翻译:几乎所有采矿软件储存库的研究都要求作为第一步选择主题软件储存库,这些储存库通常使用研究目标所规定的具体选择标准从GitHub等托管服务处收集,这些储存库通常使用GitHub等具体选择标准收集。例如,与许可证有关的研究可能有兴趣选择明确宣布许可证的项目。一旦确定了选择标准,GitHub API等公用事业可以用来“质询”托管服务。然而,研究人员必须处理这些API规定的使用限制和缺乏所需信息。例如,GitHub搜索API系统允许每分钟提出30项请求,在搜索存储库时仅提供有限的信息(例如,一个存储库中的承诺数目不包括在内)。为了支持来自GitHub的抽样项目的研究人员,我们提供GIFS系统(GitHub Search),一个含有25个特性(例如承诺数、许可等)的735,669个储存库以10种编程语言编写。一套特征是通过寻找经常使用的项目选择标准来计算每分钟内30项请求,检索存储库时只能提供最新数据,并且不断更新数据库中的数据,通过GIS数据库项目可以不断更新。