带有 Python 的 HPC 数据工程数据 (Data Engineering for HPC with Python)

Vibhatha Abeykoon,Niranda Perera,Chathura Widanage,Supun Kamburugamuve,Thejaka Amila Kanewala,Hasara Maithree,Pulasthi Wickramasinghe,Ahmet Uyar,Geoffrey Fox

from arxiv, 9 pages, 11 images, Accepted in 9th Workshop on Python for High-Performance and Scientific Computing (In conjunction with Supercomputing 20)

Data engineering is becoming an increasingly important part of scientific discoveries with the adoption of deep learning and machine learning. Data engineering deals with a variety of data formats, storage, data extraction, transformation, and data movements. One goal of data engineering is to transform data from original data to vector/matrix/tensor formats accepted by deep learning and machine learning applications. There are many structures such as tables, graphs, and trees to represent data in these data engineering phases. Among them, tables are a versatile and commonly used format to load and process data. In this paper, we present a distributed Python API based on table abstraction for representing and processing data. Unlike existing state-of-the-art data engineering tools written purely in Python, our solution adopts high performance compute kernels in C++, with an in-memory table representation with Cython-based Python bindings. In the core system, we use MPI for distributed memory computations with a data-parallel approach for processing large datasets in HPC clusters.

翻译：随着深层学习和机器学习的采用,数据工程正在成为科学发现中日益重要的一部分。数据工程涉及各种数据格式、储存、数据提取、转换和数据移动。数据工程的一个目标是将数据从原始数据转换为由深层学习和机器学习应用所接受的矢量/矩阵/传感器格式。有许多结构,如表格、图表和树等,可以代表数据工程阶段的数据。其中,表格是一种通用的通用格式,可灵活地装载和处理数据。在本文中,我们展示了一种分布式的Python API, 其依据是表格抽象图,用于代表和处理数据。与纯在Python书写的现有最新数据工程工具不同,我们的解决办法在C++时采用了高性能的计算内核,并带有基于Cython的Python绑定的模拟表。在核心系统中,我们使用多功能模型来分配记忆计算。我们用一种数据并行的方法处理高能谱群中的大型数据集。

相关内容

Engineering

关注 6

《工程》是中国工程院（CAE）于2015年推出的国际开放存取期刊。其目的是提供一个高水平的平台，传播和分享工程研发的前沿进展、当前主要研究成果和关键成果；报告工程科学的进展，讨论工程发展的热点、兴趣领域、挑战和前景，在工程中考虑人与环境的福祉和伦理道德，鼓励具有深远经济和社会意义的工程突破和创新，使之达到国际先进水平，成为新的生产力，从而改变世界，造福人类，创造新的未来。期刊链接：https://www.sciencedirect.com/journal/engineering

使用Python进行医疗临床文本处理，37页ppt

专知会员服务

39+阅读 · 2020年8月5日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

81+阅读 · 2020年7月26日

Python分布式计算，171页pdf，Distributed Computing with Python

专知会员服务

108+阅读 · 2020年5月3日