As deep learning becomes more expensive, both in terms of time and compute, inefficiencies in machine learning (ML) training prevent practical usage of state-of-the-art models for most users. The newest model architectures are simply too large to be fit onto a single processor. To address the issue, many ML practitioners have turned to model parallelism as a method of distributing the computational requirements across several devices. Unfortunately, the sequential nature of neural networks causes very low efficiency and device utilization in model parallel training jobs. We propose a new form of "shard parallelism" combining task and model parallelism, then package it into a framework we name Hydra. Hydra recasts the problem of model parallelism in the multi-model context to produce a fine-grained parallel workload of independent model shards, rather than independent models. This new parallel design promises dramatic speedups relative to the traditional model parallelism paradigm.
翻译:由于深层次的学习在时间和计算方面越来越昂贵,机器学习(ML)培训效率低下使大多数用户无法实际使用最先进的模型。最新的模型结构过于庞大,无法适应单一的处理器。为了解决这个问题,许多ML实践者转而采用模型平行主义,作为将计算要求分散到多种装置的方法。不幸的是,神经网络的相继性质导致模型平行培训工作中的效率非常低,设备利用率也非常低。我们提出了一种新的“硬平行主义”形式,将任务和模型平行主义结合起来,然后将其包装成一个框架,我们称之为海德拉。海德拉在多模式背景下重新提出了模型平行主义问题,以产生一种由独立模型碎片而不是独立模型的细微平行工作量。这一新的平行设计预示着相对于传统的模型平行主义范式的快速增长。