LabelGit: 使用属性依赖图进行软件存储器分类的数据集 (LabelGit: A Dataset for Software Repositories Classification using Attributed Dependency Graphs)

Software repository hosting services contain large amounts of open-source software, with GitHub hosting more than 100 million repositories, from new to established ones. Given this vast amount of projects, there is a pressing need for a search based on the software's content and features. However, even though GitHub offers various solutions to aid software discovery, most repositories do not have any labels, reducing the utility of search and topic-based analysis. Moreover, classifying software modules is also getting more importance given the increase in Component-Based Software Development. However, previous work focused on software classification using keyword-based approaches or proxies for the project (e.g., README), which is not always available. In this work, we create a new annotated dataset of GitHub Java projects called LabelGit. Our dataset uses direct information from the source code, like the dependency graph and source code neural representations from the identifiers. Using this dataset, we hope to aid the development of solutions that do not rely on proxies but use the entire source code to perform classification.

翻译：软件库主机服务包含大量开放源码软件, GitHub 拥有从新到已建立的1亿多个库库。鉴于此大量项目,迫切需要根据软件的内容和特性进行搜索。然而,尽管GitHub 提供了各种解决方案来帮助软件的发现,但大多数库没有标签,减少了搜索和专题分析的效用。此外,由于基于组件的软件开发的增加,软件模块的分类也越来越重要。然而,以往的工作重点是软件分类,使用基于关键词的方法或项目代理(例如README),但并不总有这种方法。在这项工作中,我们创建了一套名为LabelGit的Git Git GtHub Java项目附加说明的数据集。我们的数据集使用源代码的直接信息,如依赖图和源代码神经表,来自标识器。使用该数据集,我们希望帮助开发不依赖源代码但使用整个源代码进行分类的解决方案。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

Python分布式计算，171页pdf，Distributed Computing with Python

专知会员服务

108+阅读 · 2020年5月3日

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

专知会员服务

96+阅读 · 2020年3月12日

【WWW2020-MAGNN】异质图嵌入的集合图神经网络 MAGNN: Metapath Aggregated Graph Neural Network for Heterogeneous Graph Embedding

专知会员服务

116+阅读 · 2020年2月10日

社交网络上议题社群的公共焦虑研究，中国人民大学新闻学院塔娜讲师，第八届全国社会媒体处理大会SMP2019

专知会员服务

15+阅读 · 2019年10月23日