争取实现机器学习系统变革分类学 (Towards a Change Taxonomy for Machine Learning Systems)

Machine Learning (ML) research publications commonly provide open-source implementations on GitHub, allowing their audience to replicate, validate, or even extend machine learning algorithms, data sets, and metadata. However, thus far little is known about the degree of collaboration activity happening on such ML research repositories, in particular regarding (1) the degree to which such repositories receive contributions from forks, (2) the nature of such contributions (i.e., the types of changes), and (3) the nature of changes that are not contributed back to forks, which might represent missed opportunities. In this paper, we empirically study contributions to 1,346 ML research repositories and their 67,369 forks, both quantitatively and qualitatively (by building on Hindle et al.'s seminal taxonomy of code changes). We found that while ML research repositories are heavily forked, only 9% of the forks made modifications to the forked repository. 42% of the latter sent changes to the parent repositories, half of which (52%) were accepted by the parent repositories. Our qualitative analysis on 539 contributed and 378 local (fork-only) changes, extends Hindle et al.'s taxonomy with one new top-level change category related to ML (Data), and 15 new sub-categories, including nine ML-specific ones (input data, output data, program data, sharing, change evaluation, parameter tuning, performance, pre-processing, model training). While the changes that are not contributed back by the forks mostly concern domain-specific customizations and local experimentation (e.g., parameter tuning), the origin ML repositories do miss out on a non-negligible 15.4% of Documentation changes, 13.6% of Feature changes and 11.4% of Bug fix changes. The findings in this paper will be useful for practitioners, researchers, toolsmiths, and educators.

翻译：机器学习( ML) 研究出版物通常在 GitHub 上提供开放源码执行工具, 使读者能够复制、验证甚至扩展机器学习算法、数据集和元数据。然而,迄今为止,对于此类ML研究库的合作活动程度所知甚少, 特别是(1) 这些ML研究库从叉子接收贡献的程度,(2) 这些贡献的性质( 变化的类型) 和(3) 这些贡献的性质( 即变化的类型) 不回溯到前叉子库, 这可能代表错失机会。在本文中, 我们实证地研究了对1 346 ML 研究库及其67 369 福特和定性的机器学习算法、数据集。我们发现, 虽然ML研究库从前叉子接收贡献了多少,但只有9 % 。后者向母库发送了有用的改变, 其中一半( 52%), 母库中的数据库接受了。我们对539 和 378 本地( 错误) 的参数和方向值方向值值, 文档的数值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值