Machine learning has proved to be a useful tool for extracting knowledge from scientific data in numerous research fields, including astrophysics, genomics, and molecular dynamics. Often, data sets from these research areas need to be processed in distributed platforms due to their magnitude. This can be done using one of the various distributed machine learning libraries available. One of these libraries is dislib, a distributed machine learning library for Python especially designed to process large scale data sets on HPC clusters, which makes dislib an ideal candidate for analyzing scientific data. However, dislib's main distributed data structure, called Dataset, has some limitations, including poor performance in certain operations and low flexibility and usability. In this paper, we propose a novel distributed data structure for dislib, called ds-array, that addresses dislib's main limitations in data management. Ds-arrays simplify distributed data management in dislib by exposing a NumPy-like API, provide more flexibility, and reduce the computational complexity of some operations. This results in performance improvements of up to two orders of magnitude over Datasets, while also greatly improving scalability and usability.
翻译:机器学习被证明是在许多研究领域,包括天体物理学、基因组学和分子动态学领域从科学数据中提取知识的有用工具。 通常,这些研究领域的数据集由于规模巨大,需要在分布式平台中处理。 可以利用分布式机器学习图书馆之一来完成这项工作。 这些图书馆之一是分散式Python的机器学习图书馆,这是一个分布式图书馆,专门用来处理高频PC群集的大规模数据集,这使脱离式数据成为分析科学数据的理想候选对象。然而, dislib 的主要分布式数据结构,称为Dataset,有一定的局限性,包括某些操作的性能差,灵活性和可用性低。 在本文中,我们提出一个新的分布式数据结构,称为ds-ray,用于解决数据管理中脱式的缺陷。 Ds-rays通过曝光类似 NumPy 的API, 简化分布式数据管理不协调,提供更大的灵活性,并降低某些操作的计算复杂性。这导致工作业绩改进到两个级,超越数据集,同时大大改进可缩度和可用性。