关于使用多标签分类分类法的软件储存库的专题建议 (Topic Recommendation for Software Repositories using Multi-label Classification Algorithms)

Many platforms exploit collaborative tagging to provide their users with faster and more accurate results while searching or navigating. Tags can communicate different concepts such as the main features, technologies, functionality, and the goal of a software repository. Recently, GitHub has enabled users to annotate repositories with topic tags. It has also provided a set of featured topics, and their possible aliases carefully curated with the help of the community. This creates the opportunity to use this initial seed of topics to automatically annotate all remaining repositories, by training models that recommend high-quality topic tags to developers. In this work, we study the application of multi-label classification techniques to predict software repositories' topics. First, we map the large space of user-defined topics to those featured by GitHub. The core idea is to derive more information from projects' available documentation. Our data contains about $152$K GitHub repositories and $228$ featured topics. Then, we apply supervised models on repositories' textual information such as descriptions, README files, wiki pages, and file names. We assess the performance of our approach both quantitatively and qualitatively. Our proposed model achieves Recall@5 and LRAP scores of $0.890$ and $0.805$, respectively. Moreover, based on users' assessment, our approach is highly capable of recommending a correct and complete set of topics. Finally, we use our models to develop an online tool named \texttt{Repository Catalogue}, that automatically predicts topics for GitHub repositories and is publicly available.

翻译：许多平台在搜索或导航时利用协作标签,为用户提供更快、更准确的搜索或导航结果。标签可以交流不同的概念, 如软件库的主要特征、技术、功能和目标。最近, GitHub 使用户能够用主题标签对存储库进行批注。它也提供了一组专题, 以及他们在社区帮助下仔细整理的别名。这为使用这个初始主题种子, 通过向开发者推荐高质量主题标签的培训模型, 自动通知所有剩余存储库。在这项工作中, 我们研究多标签分类技术的应用, 以预测软件库的自动主题。首先, 我们绘制了用户定义主题的大空间, 与 GitHub 所显示的一样。核心想法是从项目现有文档中获取更多信息。我们的数据包含大约 152 $K GitHub 储存库和 228 主题。然后, 我们将监管模型的模型应用于存储库的文本信息, 如描述、 REAME 文件、 wiki 页面和文件名称。我们评估我们的方法的绩效 $90 和质量, 最后我们提议的模型和透明的用户将实现一个高等级。我们的系统和的的的排名。