Data engineering is becoming an increasingly important part of scientific discoveries with the adoption of deep learning and machine learning. Data engineering deals with a variety of data formats, storage, data extraction, transformation, and data movements. One goal of data engineering is to transform data from original data to vector/matrix/tensor formats accepted by deep learning and machine learning applications. There are many structures such as tables, graphs, and trees to represent data in these data engineering phases. Among them, tables are a versatile and commonly used format to load and process data. In this paper, we present a distributed Python API based on table abstraction for representing and processing data. Unlike existing state-of-the-art data engineering tools written purely in Python, our solution adopts high performance compute kernels in C++, with an in-memory table representation with Cython-based Python bindings. In the core system, we use MPI for distributed memory computations with a data-parallel approach for processing large datasets in HPC clusters.
翻译:随着深层学习和机器学习的采用,数据工程正在成为科学发现中日益重要的一部分。数据工程涉及各种数据格式、储存、数据提取、转换和数据移动。数据工程的一个目标是将数据从原始数据转换为由深层学习和机器学习应用所接受的矢量/矩阵/传感器格式。有许多结构,如表格、图表和树等,可以代表数据工程阶段的数据。其中,表格是一种通用的通用格式,可灵活地装载和处理数据。在本文中,我们展示了一种分布式的Python API, 其依据是表格抽象图,用于代表和处理数据。与纯在Python书写的现有最新数据工程工具不同,我们的解决办法在C++时采用了高性能的计算内核,并带有基于Cython的Python绑定的模拟表。在核心系统中,我们使用多功能模型来分配记忆计算。我们用一种数据并行的方法处理高能谱群中的大型数据集。