Distributed data storage services tailored to specific applications have grown popular in the high-performance computing (HPC) community as a way to address I/O and storage challenges. These services offer a variety of specific interfaces, semantics, and data representations. They also expose many tuning parameters, making it difficult for their users to find the best configuration for a given workload and platform. To address this issue, we develop a novel variational-autoencoder-guided asynchronous Bayesian optimization method to tune HPC storage service parameters. Our approach uses transfer learning to leverage prior tuning results and use a dynamically updated surrogate model to explore the large parameter search space in a systematic way. We implement our approach within the DeepHyper open-source framework, and apply it to the autotuning of a high-energy physics workflow on Argonne's Theta supercomputer. We show that our transfer-learning approach enables a more than $40\times$ search speedup over random search, compared with a $2.5\times$ to $10\times$ speedup when not using transfer learning. Additionally, we show that our approach is on par with state-of-the-art autotuning frameworks in speed and outperforms them in resource utilization and parallelization capabilities.
翻译:针对特定应用的分布式数据存储服务在高性能计算(HPC)社区中越来越受欢迎,作为应对I/O和存储挑战的一种方式。这些服务提供了各种特定的界面、语义和数据表达方式。它们还暴露了许多调试参数,使用户难以找到特定工作量和平台的最佳配置。为解决这一问题,我们开发了一种新的变式自动编码器-引导无同步的巴耶西亚优化方法,以调和HPC存储服务参数。我们的方法利用转移学习来利用先前调试结果,并使用动态更新的替代模型系统探索大参数搜索空间。我们在DeepHyper开放源框架内实施我们的方法,并将其应用于阿贡的Theta超级计算机上高能物理工作流程的自动调整。我们展示了我们的传输学习方法使得在随机搜索中搜索速度超过40元的搜索速度,而使用2.5\时间为10美元,在不进行传输学习时加速速度。此外,我们在DeepHyperoper开放源框架内实施我们的方法,并将其应用于Argonne's Theta超级计算机上的高能物理工作流程的自动调整。我们要在同步框架中进行自动调整。