Scaling up model depth and size is now a common approach to raise accuracy in many deep learning (DL) applications, as evidenced by the widespread success of multi-billion or even trillion parameter models in natural language processing (NLP) research. Despite success in DL research and at major technology companies, broader practical adoption of such large models among domain scientists and businesses is still bottlenecked by GPU memory limits, high training costs, and low GPU availability, even on public clouds. Model selection needs further compound these resource challenges: users often need to compare dozens of models with different hyper-parameters or neural architectures to suit their specific task and dataset. In this paper, we present Hydra, a system designed to tackle such challenges by enabling out-of-the-box scaling for multi-large-model DL workloads on even commodity GPUs in a resource-efficient manner. Hydra is the first approach to holistically optimize the execution of multi-model workloads for large DL models. We do this by adapting prior "model-parallel" execution schemes to work with scalable parameter offloading across the memory hierarchy and further hybridizing this approach with task-parallel job scheduling techniques. Hydra decouples scalability of model parameters from parallelism of execution, thus enabling DL users to train even a 6-billion parameter model on a single commodity GPU. It also fully exploits the speedup potential of task parallelism in multi-GPU setups, yielding near-linear strong scaling and making rigorous model selection perhaps more practical for such models. We evaluate end-to-end performance by fine-tuning GPT-2 for language modeling. We find that Hydra offers between 50% and 100% higher training throughput than even the best settings of state-of-the-art industrial frameworks such as DeepSpeed and GPipe for multi-large-model training.
翻译:扩大模型深度和尺寸现在已成为提高许多深层次学习(DL)应用应用的准确性的共同方法,这体现在自然语言处理(NLP)研究中数十亿甚至万亿个参数模型的广泛成功。尽管在DL研究中和主要技术公司中取得了成功,但在域科学家和企业中广泛实际采用这类大型模型仍然受到资源效率高的制约,甚至公共云层的GPU内存限制、高培训成本和低GPU的可用性仍然受到制约。模型选择需要进一步增加这些资源挑战:用户往往需要将数十个模型与不同的超常参数或神经结构进行比较,以适应他们的具体任务和数据集。在本文件中,我们介绍一个旨在应对此类挑战的系统,即通过多模型存储、高超超常参数化模型或神经元结构,在更接近的记忆-更深层任务级结构中,使多模式的GL工作量超升标准,从而能够应对此类挑战。