In many deep learning (DL) applications, the desire for ever higher accuracy and the new ubiquity of transfer learning has led to a marked increase in the size and depth of model architectures. Thus, the memory capacity of GPUs is often a bottleneck for DL practitioners. Existing techniques that rely on partitioning the model architecture across a network of GPUs suffer from substantial underutilization and busy waiting due to sequential dependencies in most large-scale model architectures (Transformers, CNNs). We observe that almost all such prior large-model systems focus on training only one model at a time, but in reality DL practitioners often train many models in bulk due to model selection needs, e.g., hyper-parameter tuning, architecture finetuning, etc. This gap leads to significant system inefficiency. We approach this problem from first principles and propose a new information system architecture for scalable multi-model training that adapts and blends ideas from classical RDBMS design with task parallelism from the ML world. We propose a suite of techniques to optimize system efficiency holistically, including a highly general parameter-spilling design that enables large models to be trained even with a single GPU, a novel multi-query optimization scheme that blends model execution schedules efficiently and maximizes GPU utilization, and a double buffering idea to hide latency. We prototype our ideas on top of PyTorch to build a system we call Hydra. Experiments with real benchmark large-scale multi-model DL workloads show that Hydra is over 7x faster than regular model parallelism and 1.8-4.5X faster than state-of-the-art industrial tools for large-scale model training.
翻译:在许多深层次的学习(DL)应用中,由于对更高准确度的渴望以及转移学习的全新常态性,几乎所有以前大型系统都侧重于一次只培训一个模型,但在现实中,DL从业者往往由于模型选择需要而大量培训许多模型,例如超参数调整、结构微调等,因此GPU的记忆能力往往是DL从业者面临的瓶颈。依靠将模型结构隔开到GPU网络的现有技术,由于大多数大型模型结构(Transtrades,CNNs)的连续依赖性,使用率严重不足和繁忙等待。我们注意到,几乎所有以前这类大型模型系统都侧重于一次只培训一个模型,但在现实中,DL从业者往往大量培训许多模型,因为模型选择需要,例如超参数调整、结构微调等等,因此,GPUPU的常规能力往往是一个瓶颈。 我们从最初的原则出发,提出一个新的信息系统结构架构,将我们传统的RDBMS设计中的想法与ML世界的任务平行结合起来。我们建议一套技术,在整体上优化系统效率,包括一个高度通用的DLILS-S-Sloial-lial-lial-hal-hal-lial-h-lish-lish-h-h-hing-modal-modal-mod-modal-modal-modal-mod-modal-modaldaldaldald-modal-modal-modalking-mod-mod-modal-modaldaldaldaldal-modal-s