Software repository hosting services contain large amounts of open-source software, with GitHub hosting more than 100 million repositories, from new to established ones. Given this vast amount of projects, there is a pressing need for a search based on the software's content and features. However, even though GitHub offers various solutions to aid software discovery, most repositories do not have any labels, reducing the utility of search and topic-based analysis. Moreover, classifying software modules is also getting more importance given the increase in Component-Based Software Development. However, previous work focused on software classification using keyword-based approaches or proxies for the project (e.g., README), which is not always available. In this work, we create a new annotated dataset of GitHub Java projects called LabelGit. Our dataset uses direct information from the source code, like the dependency graph and source code neural representations from the identifiers. Using this dataset, we hope to aid the development of solutions that do not rely on proxies but use the entire source code to perform classification.
翻译:软件库主机服务包含大量开放源码软件, GitHub 拥有从新到已建立的1亿多个库库。 鉴于此大量项目,迫切需要根据软件的内容和特性进行搜索。 然而,尽管GitHub 提供了各种解决方案来帮助软件的发现,但大多数库没有标签,减少了搜索和专题分析的效用。此外,由于基于组件的软件开发的增加,软件模块的分类也越来越重要。然而,以往的工作重点是软件分类,使用基于关键词的方法或项目代理(例如README),但并不总有这种方法。在这项工作中,我们创建了一套名为LabelGit的Git Git GtHub Java项目附加说明的数据集。我们的数据集使用源代码的直接信息,如依赖图和源代码神经表,来自标识器。使用该数据集,我们希望帮助开发不依赖源代码但使用整个源代码进行分类的解决方案。