Modern software development is based on a series of rapid incremental changes collaboratively made to large source code repositories by developers with varying experience and expertise levels. The ZeroIn project is aimed at analyzing the metadata of these dynamic phenomena, including the data on repositories, commits, and developers, to rapidly and accurately mark the quality of commits as they arrive at the repositories. In this context, the present article presents a characterization of the software development metadata in terms of distributions of data that best captures the trends in the datasets. Multiple datasets are analyzed for this purpose, including Stack Overflow on developers' features and GitHub data on over 452 million repositories with 16 million commits. This characterization is intended to make it possible to generate multiple synthetic datasets that can be used in training and testing novel machine learning-based solutions to improve the reliability of software even as it evolves. It is also aimed at serving the development process to exploit the latent correlations among many key feature vectors across the aggregate space of repositories and developers. The data characterization of this article is designed to feed into the machine learning components of ZeroIn, including the application of binary classifiers for early flagging of buggy software commits and the development of graph-based learning methods to exploit sparse connectivity among the sets of repositories, commits, and developers.
翻译:现代软件开发的基础是,具有不同经验和专门知识水平的开发者对大型源代码库进行了一系列迅速渐进式的合作变化,该项目旨在分析这些动态现象的元数据,包括储存库的数据,承诺和开发者在到达储存库时迅速准确地标出承诺的质量;在这方面,本篇文章介绍了软件开发元数据在最能捕捉数据集趋势的数据分布方面的特征特征特征特征,为此分析了多个数据集,包括开发者特征的堆积溢流和4亿5千2百万个储存库的GitHub数据,承诺1 600万个储存库。这一特征旨在有可能产生多种合成数据集,用于培训和测试新的机器学习解决方案,以提高软件的可靠性,即使软件正在演进;本文章还旨在推动开发过程,以利用储存库和开发者综合空间中许多主要特性矢量之间的潜在关联。这一数据集的特征特征特征特征分析旨在输入ZeroIn的机器学习组件,包括将智能存储器的智能存储器用于开发,并承诺在数据库中采用智能的升级方法。