Today, deep learning is an essential technology for our life. To solve more complex problems with deep learning, both sizes of training datasets and neural networks are increasing. To train a model with large datasets and networks, distributed deep neural network (DDNN) training technique is necessary. For large-scale DDNN training, HPC clusters are a promising computation environment. In large-scale DDNN on HPC clusters, I/O performance is critical because it is becoming a bottleneck. Most flagship-class HPC clusters have hierarchical storage systems. For designing future HPC storage systems, it is necessary to quantify the performance improvement effect of the hierarchical storage system on the workloads. This paper demonstrates the quantitative performance analysis of the hierarchical storage system for DDNN workload in a flagship-class supercomputer. Our analysis shows how much performance improvement and volume increment of the storage will be required to meet the performance goal.
翻译:今天,深层次学习是生活中必不可少的技术。为了通过深层次学习解决更复杂的问题,培训数据集和神经网络的规模都在增加。要用大型数据集和网络来培训模型,就需要分布式深层神经网络(DNN)培训技术。对于大型的DDNN培训来说,HPC集群是一个有希望的计算环境。在大型的HPC集群DNN中,I/O性能至关重要,因为它正在成为一个瓶颈。大多数旗舰级的HPC集群都拥有等级级存储系统。为了设计未来的HPC储存系统,有必要量化等级储存系统对工作量的性能改进效果。本文展示了旗舰级超级计算机对DNN工作量的等级储存系统的定量绩效分析。我们的分析表明,要实现绩效目标,需要多少改进和数量增加存储量。