GitHub is the world's largest host of source code, with more than 150M repositories. However, most of these repositories are not labeled or inadequately so, making it harder for users to find relevant projects. There have been various proposals for software application domain classification over the past years. However, these approaches lack a well-defined taxonomy that is hierarchical, grounded in a knowledge base, and free of irrelevant terms. This work proposes GitRanking, a framework for creating a classification ranked into discrete levels based on how general or specific their meaning is. We collected 121K topics from GitHub and considered $60\%$ of the most frequent ones for the ranking. GitRanking 1) uses active sampling to ensure a minimal number of required annotations; and 2) links each topic to Wikidata, reducing ambiguities and improving the reusability of the taxonomy. Our results show that developers, when annotating their projects, avoid using terms with a high degree of specificity. This makes the finding and discovery of their projects more challenging for other users. Furthermore, we show that GitRanking can effectively rank terms according to their general or specific meaning. This ranking would be an essential asset for developers to build upon, allowing them to complement their annotations with more precise topics. Finally, we show that GitRanking is a dynamically extensible method: it can currently accept further terms to be ranked with a minimum number of annotations ($\sim$ 15). This paper is the first collective attempt to build a ground-up taxonomy of software domains.
翻译:GitHub 是世界上最大的源代码主机库, 拥有超过 150 M 储存库 。 然而, 这些存储库大多没有标签或不完善, 使用户更难找到相关项目 。 在过去几年里, 提出了各种软件应用域分类建议 。 但是, 这些方法缺乏一个定义明确的分类系统, 它具有等级性, 以知识库为基础, 并且没有不相关的术语 。 这项工作提出了 GitRanking, 这个框架用于根据项目的一般或具体含义, 将分类分为不同级别 。 我们从 GitHub 收集了 121 K 个专题, 并考虑了 最经常排序的60 美元 。 GitRanking 1 使用主动抽样来确保最低数量的所需说明; 和 2 将每个专题链接到 Wikigicadata, 减少模糊性, 改善分类系统的可重复性 。 我们的结果表明, 开发者在说明其项目时, 避免使用高度的术语。 这让其他用户更难于其项目的发现和发现。 此外, 我们表明, GitR 能够有效地按当前普通或特定的顺序排列术语来, 。