Matrix computations are widely used in increasing sizes and complexity in scientific computing and engineering. But current matrix language implementations lack programmer support to effectively and seamlessly utilize cloud computing resources. We extend the Julia high-performance compute language to automatically parallelize matrix computations for the cloud. Users are shielded from the complexity of explicitly-parallel computations through the provision of a novel matrix data type with lazy evaluation semantics. Delayed evaluation aggregates operations into expression trees that are rewritten on-the-fly to eliminate common subexpressions and apply optimizations such as exponentiation-by-squaring on matching subtrees. Trees are lowered into DAGs for which dynamic simulation selects the optimal tile size and execution schedule for a given cluster of cloud nodes. We employ off-line profiling to construct a time model for the compute and network capacity of the cluster. The experimental evaluation of our framework comprises eleven benchmarks on a cluster of eight nodes (288 vCPUs) in the AWS public cloud and reveals speedups of up to a factor of 4.11x, with an average 78.36% of the theoretically possible maximum speedup.
翻译:在科学计算和工程中,矩阵计算被广泛用于增加规模和复杂性的科学计算和工程。但目前的矩阵语言执行缺乏有效、无缝地利用云计算资源的程序员支持。我们扩展朱丽亚高性能计算语言以自动平行计算云层。用户通过提供带有懒惰评价语义的新型矩阵数据类型而避免了明确平行计算的复杂性。延迟评价将作业汇总到表达树中,这些树在飞行上重新写成,以消除常见的子表达式,并应用优化,如匹配子树上的逐层排出。树被降为DAG,动态模拟为特定云节点选择最佳的瓦体大小和执行时间表。我们采用离线剖面图来构建一个计算和网络能力的时间模型。我们框架的实验性评价包括AWS公共云中8个节点(288 vCPUs)的11个基准,并显示最高速度为4.11x的系数,平均78.36%的理论上可能最大速度。