Deep learning has been postulated as a solution for numerous problems in different branches of science. Given the resource-intensive nature of these models, they often need to be executed on specialized hardware such graphical processing units (GPUs) in a distributed manner. In the academic field, researchers get access to this kind of resources through High Performance Computing (HPC) clusters. This kind of infrastructures make the training of these models difficult due to their multi-user nature and limited user permission. In addition, different HPC clusters may possess different peculiarities that can entangle the research cycle (e.g., libraries dependencies). In this paper we develop a workflow and methodology for the distributed training of deep learning models in HPC clusters which provides researchers with a series of novel advantages. It relies on udocker as containerization tool and on Horovod as library for the distribution of the models across multiple GPUs. udocker does not need any special permission, allowing researchers to run the entire workflow without relying on any administrator. Horovod ensures the efficient distribution of the training independently of the deep learning framework used. Additionally, due to containerization and specific features of the workflow, it provides researchers with a cluster-agnostic way of running their models. The experiments carried out show that the workflow offers good scalability in the distributed training of the models and that it easily adapts to different clusters.
翻译:深度学习被假定为是不同科学部门许多问题的解决方案。鉴于这些模型的资源密集型性质,往往需要以分布式方式在专门硬件上执行这些模型。在学术领域,研究人员可以通过高性能计算组(HPC)获取这类资源。这种基础设施使得这些模型的培训难以进行,因为其多用户性质和用户许可有限。此外,不同的高常委会集群可能具有不同的特殊性,能够缠绕研究周期(例如,图书馆依赖性)。在本文件中,我们为在高常委会集群中分散的深层学习模型培训开发了工作流程和方法,为研究人员提供一系列新的优势。它依靠udocker作为集装箱化工具,并依靠Horovod作为图书馆将这些模型分布在多个GPUs。udocker不需要任何特别许可,允许研究人员在不依赖任何管理员的情况下运行整个工作流程。Horovod确保了培训的高效分布,独立于所使用的深深层学习框架。此外,由于集装箱化和具体的培训模式的可操作性,它向研究人员展示了不同工作流程的可扩展性模型,它提供了一种可移动性模型。