ds- 数组: 用于大规模机器学习的分布式数据结构 (ds-array: A Distributed Data Structure for Large Scale Machine Learning)

Machine learning has proved to be a useful tool for extracting knowledge from scientific data in numerous research fields, including astrophysics, genomics, and molecular dynamics. Often, data sets from these research areas need to be processed in distributed platforms due to their magnitude. This can be done using one of the various distributed machine learning libraries available. One of these libraries is dislib, a distributed machine learning library for Python especially designed to process large scale data sets on HPC clusters, which makes dislib an ideal candidate for analyzing scientific data. However, dislib's main distributed data structure, called Dataset, has some limitations, including poor performance in certain operations and low flexibility and usability. In this paper, we propose a novel distributed data structure for dislib, called ds-array, that addresses dislib's main limitations in data management. Ds-arrays simplify distributed data management in dislib by exposing a NumPy-like API, provide more flexibility, and reduce the computational complexity of some operations. This results in performance improvements of up to two orders of magnitude over Datasets, while also greatly improving scalability and usability.

翻译：机器学习被证明是在许多研究领域,包括天体物理学、基因组学和分子动态学领域从科学数据中提取知识的有用工具。通常,这些研究领域的数据集由于规模巨大,需要在分布式平台中处理。可以利用分布式机器学习图书馆之一来完成这项工作。这些图书馆之一是分散式Python的机器学习图书馆,这是一个分布式图书馆,专门用来处理高频PC群集的大规模数据集,这使脱离式数据成为分析科学数据的理想候选对象。然而, dislib 的主要分布式数据结构,称为Dataset,有一定的局限性,包括某些操作的性能差,灵活性和可用性低。在本文中,我们提出一个新的分布式数据结构,称为ds-ray,用于解决数据管理中脱式的缺陷。 Ds-rays通过曝光类似 NumPy 的API, 简化分布式数据管理不协调,提供更大的灵活性,并降低某些操作的计算复杂性。这导致工作业绩改进到两个级,超越数据集,同时大大改进可缩度和可用性。

相关内容

Machine Learning

关注 2241

机器学习（Machine Learning）是一个研究计算学习方法的国际论坛。该杂志发表文章，报告广泛的学习方法应用于各种学习问题的实质性结果。该杂志的特色论文描述研究的问题和方法，应用研究和研究方法的问题。有关学习问题或方法的论文通过实证研究、理论分析或与心理现象的比较提供了坚实的支持。应用论文展示了如何应用学习方法来解决重要的应用问题。研究方法论文改进了机器学习的研究方法。所有的论文都以其他研究人员可以验证或复制的方式描述了支持证据。论文还详细说明了学习的组成部分，并讨论了关于知识表示和性能任务的假设。官网地址：http://dblp.uni-trier.de/db/journals/ml/

AAAI2021 | 图神经网络的异质图结构学习，Heterogeneous Graph Structure Learning for Graph Neural Networks

专知会员服务

92+阅读 · 2021年1月20日

【大规模机器学习】综述论文，20页pdf，A Survey on Large-scale Machine

专知会员服务

66+阅读 · 2020年8月13日

Python分布式计算，171页pdf，Distributed Computing with Python

专知会员服务

108+阅读 · 2020年5月3日

【SIGIR2020】学习词项区分性，Learning Term Discrimination

专知会员服务

16+阅读 · 2020年4月28日