Distributed file systems are widely used nowadays, yet using their default configurations is often not optimal. At the same time, tuning configuration parameters is typically challenging and time-consuming. It demands expertise and tuning operations can also be expensive. This is especially the case for static parameters, where changes take effect only after a restart of the system or workloads. We propose a novel approach, Magpie, which utilizes deep reinforcement learning to tune static parameters by strategically exploring and exploiting configuration parameter spaces. To boost the tuning of the static parameters, our method employs both server and client metrics of distributed file systems to understand the relationship between static parameters and performance. Our empirical evaluation results show that Magpie can noticeably improve the performance of the distributed file system Lustre, where our approach on average achieves 91.8% throughput gains against default configuration after tuning towards single performance indicator optimization, while it reaches 39.7% more throughput gains against the baseline.
翻译:现在广泛使用分布式文件系统,但使用其默认配置往往不是最佳的。 同时,调制配置参数通常具有挑战性和耗时性。 它需要专门知识和调制操作, 费用也很高。 静态参数尤其如此, 仅在系统或工作量重新启动后才发生改变。 我们提议一种新颖的方法, 即Magpie, 利用深度强化学习来通过战略探索和利用配置参数空间来调控静态参数。 为了促进静态参数的调控, 我们的方法使用服务器和客户对分布式文件系统的测量, 以了解静态参数和性能之间的关系。 我们的经验评估结果表明, Magpie 能够明显改善分布式文件系统Lustre的性能, 在那里,我们在调整单一绩效指标优化后,平均实现91.8 % 的吞吐量增长, 而在基线下,它达到39.7%的吞吐量收益。