Software projects under version control grow with each commit, accumulating up to hundreds of thousands of commits per repository. Especially for such large projects, the traversal of a repository and data extraction for static source code analysis poses a trade-off between granularity and speed. We showcase the command-line tool pyrepositoryminer that combines a set of optimization approaches for efficient traversal and data extraction from git repositories while being adaptable to third-party and custom software metrics and data extractions. The tool is written in Python and combines bare repository access, in-memory storage, parallelization, caching, change-based analysis, and optimized communication between the traversal and custom data extraction components. The tool allows for both metrics written in Python and external programs for data extraction. A single-thread performance evaluation based on a basic mining use case shows a mean speedup of 15.6x to other freely available tools across four mid-sized open source projects. A multi-threaded execution allows for load distribution among cores and, thus, a mean speedup up to 86.9x using 12 threads.
翻译:版本控制下的软件项目随每次承诺而增长,每个存储库积累了多达数十万个承诺。 特别是对于这样的大型项目, 用于静态源码分析的存储库和数据提取过程在颗粒度和速度之间形成了一种权衡。 我们展示了命令线工具样板存储器, 将一套高效穿行和从 Git 存储库提取数据的优化方法结合起来, 同时适应第三方和自定制软件量度和数据提取。 该工具以Python 写成, 并结合了空存库访问、 模拟存储、 平行化、 缓存、 变化分析以及优化的穿行和自定义数据提取组件之间的通信。 该工具允许在 Python 和外部数据提取程序中写出两个参数。 基于基本采矿使用案例的单读性绩效评估显示, 四个中等开放源项目的平均速度为15.6x 至其他可自由获取的工具。 多读执行允许在核心之间进行负荷分配, 从而平均速度达86.9x, 使用 12 线 。