The use of large-scale machine learning methods is becoming ubiquitous in many applications ranging from business intelligence to self-driving cars. These methods require a complex computation pipeline consisting of various types of operations, e.g., relational operations for pre-processing or post-processing the dataset, and matrix operations for core model computations. Many existing systems focus on efficiently processing matrix-only operations, and assume that the inputs to the relational operators are already pre-computed and are materialized as intermediate matrices. However, the input to a relational operator may be complex in machine learning pipelines, and may involve various combinations of matrix operators. Hence, it is critical to realize scalable and efficient relational query processors that directly operate on big matrix data. This paper presents new efficient and scalable relational query processing techniques on big matrix data for in-memory distributed clusters. The proposed techniques leverage algebraic transformation rules to rewrite query execution plans into ones with lower computation costs. A distributed query plan optimizer exploits the sparsity-inducing property of merge functions as well as Bloom join strategies for efficiently evaluating various flavors of the join operation. Furthermore, optimized partitioning schemes for the input matrices are developed to facilitate the performance of join operations based on a cost model that minimizes the communication overhead.The proposed relational query processing techniques are prototyped in Apache Spark. Experiments on both real and synthetic data demonstrate that the proposed techniques achieve up to two orders of magnitude performance improvement over state-of-the-art systems on a wide range of applications.
翻译:大型机器学习方法的使用在从商业情报到自行驾驶汽车等许多应用中正在变得无处不在。这些方法需要复杂的计算管道,包括各种类型的操作,例如预处理前或后处理数据集的关联操作和核心模型计算矩阵操作。许多现有系统侧重于高效处理只使用矩阵的操作,并假定向关系操作员提供的投入已经预先计算,并成为中间矩阵。然而,向关系操作员提供的投入在机器学习管道和自行驾驶汽车方面可能十分复杂,并可能涉及矩阵操作的各种组合。因此,实现以大矩阵数据直接操作的可缩放和高效的关系查询处理器至关重要。本文介绍了关于模拟分布式集群的大矩阵数据的新的高效和可缩放关系查询处理技术。拟议技术利用代数转换转换规则将查询执行模型改写成计算成本较低的模型。分布式查询优化计划在机器学习管道方面可能十分复杂,并且可能涉及各种矩阵操作的组合操作组合。因此,必须实现可扩缩和高效使用的关系查询器的连接关系处理程序。本文件介绍了关于模拟分布式操作的优化操作的系统的运作质量,从而实现最佳化的同步化。