High Performance Compute (HPC) clusters often produce intermediate files as part of code execution and message passing is not always possible to supply data to these cluster jobs. In these cases, I/O goes back to central distributed storage to allow cross node data sharing. These systems are often high performance and characterised by their high cost per TB and sensitivity to workload type such as being tuned to small or large file I/O. However, compute nodes often have large amounts of RAM, so when dealing with intermediate files where longevity or reliability of the system is not as important, local RAM disks can be used to obtain performance benefits. In this paper we show how this problem was tackled by creating a RAM block that could interact with the object storage system Ceph, as well as creating a deployment tool to deploy Ceph on HPC infrastructure effectively. This work resulted in a system that was more performant than the central high performance distributed storage system used at Diamond reducing I/O overhead and processing time for Savu, a tomography data processing application, by 81.04% and 8.32% respectively.
翻译:高性能计算( HPC) 群集通常生成中间文件, 作为代码执行的一部分, 传递信息并非总能为这些群集任务提供数据。 在这些情况下, I/ O 返回中央分布式存储库, 以便进行交叉节点数据共享。 这些系统通常性能很高, 其特点是每个肺结核的高昂成本和对工作量类型的敏感度, 例如对大小文件 I/ O 进行调适。 但是, 计算节点往往有大量的 RAM, 因此, 当处理系统寿命或可靠性不那么重要的中间文件时, 可以使用本地的 RAM 磁盘来获取绩效效益。 在本文中, 我们展示了如何通过创建一个可与天体存储系统Ceph 互动的RAM 块来解决这个问题, 以及创建一个部署工具来有效地将Ceph 安装在 HPC 基础设施上。 这项工作的结果是, 系统比钻石公司使用的中央高性能分布式存储系统更能将Savu 的I/ O 管理时间和处理时间分别减少 81.04% 和8.32% 。