GitHub is the largest host of open source software on the Internet. This large, freely accessible database has attracted the attention of practitioners and researchers alike. But as GitHub's growth continues, it is becoming increasingly hard to navigate the plethora of repositories which span a wide range of domains. Past work has shown that taking the application domain into account is crucial for tasks such as predicting the popularity of a repository and reasoning about project quality. In this work, we build on a previously annotated dataset of 5,000 GitHub repositories to design an automated classifier for categorising repositories by their application domain. The classifier uses state-of-the-art natural language processing techniques and machine learning to learn from multiple data sources and catalogue repositories according to five application domains. We contribute with (1) an automated classifier that can assign popular repositories to each application domain with at least 70% precision, (2) an investigation of the approach's performance on less popular repositories, and (3) a practical application of this approach to answer how the adoption of software engineering practices differs across application domains. Our work aims to help the GitHub community identify repositories of interest and opens promising avenues for future work investigating differences between repositories from different application domains.
翻译:GitHub是互联网上开放源码软件的最大主机库。 这个庞大的、可自由访问的数据库吸引了实践者和研究人员的注意。 但是,随着GitHub的继续成长,越来越难以浏览涵盖广泛领域的大量储存库。 过去的工作表明,将应用域考虑在内对于预测储存库的受欢迎程度和项目质量的推理等任务至关重要。 在这项工作中,我们利用以前附加说明的5 000 GitHub 储存库设计了一个自动分类库,以便根据应用域为分类储存库设计一个自动分类器。 分类员使用最先进的自然语言处理技术和机器学习,以便根据五个应用域从多个数据源和目录储存库中学习。 我们的贡献是:(1) 一个自动化的分类器,能够以至少70%的精确度为每个应用域分配大众储存库,(2) 调查该方法在不太受欢迎的储存库的绩效,以及(3) 实际应用这一方法来回答不同应用域采用软件工程做法的差异。我们的工作旨在帮助GitHub社区查明利益储存库,并打开未来工作不同领域之间不同应用的可行途径。