Data-intensive applications impact many domains, and their steadily increasing size and complexity demands high-performance, highly usable environments. We integrate a set of ideas developed in various data science and data engineering frameworks. They employ a set of operators on specific data abstractions that include vectors, matrices, tensors, graphs, and tables. Our key concepts are inspired from systems like MPI, HPF (High-Performance Fortran), NumPy, Pandas, Spark, Modin, PyTorch, TensorFlow, RAPIDS(NVIDIA), and OneAPI (Intel). Further, it is crucial to support different languages in everyday use in the Big Data arena, including Python, R, C++, and Java. We note the importance of Apache Arrow and Parquet for enabling language agnostic high performance and interoperability. In this paper, we propose High-Performance Tensors, Matrices and Tables (HPTMT), an operator-based architecture for data-intensive applications, and identify the fundamental principles needed for performance and usability success. We illustrate these principles by a discussion of examples using our software environments, Cylon and Twister2 that embody HPTMT.
翻译:数据密集型应用影响到许多领域,其规模和复杂性的稳步增加要求高性能、高可用性环境。我们整合了在各种数据科学和数据工程框架中开发的一套想法。它们使用一套关于具体数据抽象的操作器,包括矢量、矩阵、气压、图表和表格。我们的关键概念来自诸如MPI、HPF(高性能堡)、NumPy、Pandas、Spark、Modin、PyTorch、TensorFlow、RAPIDIS(NIPIDA)和OANAPI(Intel)等系统。此外,支持大数据领域日常使用的不同语言,包括Python、R、C+++和Java。我们注意到Appach Arrow和Parquet对于赋能语言高性能和互操作性的重要性。在本文中,我们建议高性能天分台、磁带和表(HPTMTMT),这是基于操作者基础的应用结构,并确定了业绩和可使用性成功的基本原则。我们用性、C++和CPT软件环境的体现这些CMT原则。