The extensive use of HPC infrastructures and frameworks for running data-intensive applications has led to a growing interest in data partitioning techniques and strategies. In fact, finding an effective partitioning, i.e. a suitable size for data blocks, is a key strategy to speed-up parallel data-intensive applications and increase scalability. This paper describes a methodology for data block size estimation in HPC applications, which relies on supervised machine learning techniques. The implementation of the proposed methodology was evaluated using as a testbed dislib, a distributed computing library highly focused on machine learning algorithms built on top of the PyCOMPSs framework. We assessed the effectiveness of our solution through an extensive experimental evaluation considering different algorithms, datasets, and infrastructures, including the MareNostrum 4 supercomputer. The results we obtained show that the methodology is able to efficiently determine a suitable way to split a given dataset, thus enabling the efficient execution of data-parallel applications in high performance environments.
翻译:广泛使用高频常委会基础设施和框架来运行数据密集型应用已导致对数据分割技术和战略的兴趣日益浓厚。事实上,找到有效的分隔,即适当的数据区块尺寸,是加速平行数据密集型应用和提高可缩放性的关键战略。本文介绍了高频常委会应用中的数据区块规模估计方法,该应用程序依赖受监督的机器学习技术。对拟议方法的实施进行了评估,使用了一个测试式的不协调,一个分布式计算机图书馆,高度侧重于在PyCOMPS框架之上建立的机器学习算法。我们通过广泛的实验评估,考虑到不同的算法、数据集和基础设施,包括马雷诺斯图姆四型超级计算机,评估了我们解决方案的有效性。我们获得的结果显示,该方法能够有效地确定如何将特定数据集分开,从而能够在高性能环境中高效地实施数据单词应用。