通过应用程序域自动分类生成 GitHub 存储器 (Automatically Categorising GitHub Repositories by Application Domain)

GitHub is the largest host of open source software on the Internet. This large, freely accessible database has attracted the attention of practitioners and researchers alike. But as GitHub's growth continues, it is becoming increasingly hard to navigate the plethora of repositories which span a wide range of domains. Past work has shown that taking the application domain into account is crucial for tasks such as predicting the popularity of a repository and reasoning about project quality. In this work, we build on a previously annotated dataset of 5,000 GitHub repositories to design an automated classifier for categorising repositories by their application domain. The classifier uses state-of-the-art natural language processing techniques and machine learning to learn from multiple data sources and catalogue repositories according to five application domains. We contribute with (1) an automated classifier that can assign popular repositories to each application domain with at least 70% precision, (2) an investigation of the approach's performance on less popular repositories, and (3) a practical application of this approach to answer how the adoption of software engineering practices differs across application domains. Our work aims to help the GitHub community identify repositories of interest and opens promising avenues for future work investigating differences between repositories from different application domains.

翻译：GitHub是互联网上开放源码软件的最大主机库。这个庞大的、可自由访问的数据库吸引了实践者和研究人员的注意。但是,随着GitHub的继续成长,越来越难以浏览涵盖广泛领域的大量储存库。过去的工作表明,将应用域考虑在内对于预测储存库的受欢迎程度和项目质量的推理等任务至关重要。在这项工作中,我们利用以前附加说明的5 000 GitHub 储存库设计了一个自动分类库,以便根据应用域为分类储存库设计一个自动分类器。分类员使用最先进的自然语言处理技术和机器学习,以便根据五个应用域从多个数据源和目录储存库中学习。我们的贡献是:(1) 一个自动化的分类器,能够以至少70%的精确度为每个应用域分配大众储存库,(2) 调查该方法在不太受欢迎的储存库的绩效,以及(3) 实际应用这一方法来回答不同应用域采用软件工程做法的差异。我们的工作旨在帮助GitHub社区查明利益储存库,并打开未来工作不同领域之间不同应用的可行途径。

相关内容

Automator

关注 5

Automator是苹果公司为他们的Mac OS X系统开发的一款软件。 只要通过点击拖拽鼠标等操作就可以将一系列动作组合成一个工作流，从而帮助你自动的（可重复的）完成一些复杂的工作。Automator还能横跨很多不同种类的程序，包括：查找器、Safari网络浏览器、iCal、地址簿或者其他的一些程序。它还能和一些第三方的程序一起工作，如微软的Office、Adobe公司的Photoshop或者Pixelmator等。

NeurlPS 2022 | 自然语言处理相关论文分类整理

专知会员服务

51+阅读 · 2022年10月2日

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

专知会员服务

96+阅读 · 2020年3月12日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

专知会员服务

160+阅读 · 2019年10月12日