Training large deep neural network models is highly challenging due to their tremendous computational and memory requirements. Blockwise distillation provides one promising method towards faster convergence by splitting a large model into multiple smaller models. In state-of-the-art blockwise distillation methods, training is performed block-by-block in a data-parallel manner using multiple GPUs. To produce inputs for the student blocks, the teacher model is executed from the beginning until the current block under training. However, this results in a high overhead of redundant teacher execution, low GPU utilization, and extra data loading. To address these problems, we propose Pipe-BD, a novel parallelization method for blockwise distillation. Pipe-BD aggressively utilizes pipeline parallelism for blockwise distillation, eliminating redundant teacher block execution and increasing per-device batch size for better resource utilization. We also extend to hybrid parallelism for efficient workload balancing. As a result, Pipe-BD achieves significant acceleration without modifying the mathematical formulation of blockwise distillation. We implement Pipe-BD on PyTorch, and experiments reveal that Pipe-BD is effective on multiple scenarios, models, and datasets.
翻译:深心神经网络模型因其巨大的计算和内存要求而具有高度的挑战性。 粗略蒸馏提供了一种有希望的方法,通过将一个大模型分成多个小模型来更快地融合。 在最先进的整块蒸馏方法中,使用多个GPU,以数据平行的方式进行逐条逐条培训。 为了为学生街区提供投入,教师模型从一开始一直执行到目前培训阶段。然而,这导致多余教师执行、低GPU利用率和额外数据加载的间接费用很高。为了解决这些问题,我们建议Pipe-BD,这是条状蒸馏的新型平行方法。管道-BD积极利用管道平行法进行整块蒸馏,消除多余的教师区块执行,并增加每条分批量,以便更好地利用资源。为了有效平衡工作量,我们还将这种模式推广到混合平行。作为结果,管道-BD在不修改块化蒸馏的数学配方方面实现显著加速。 我们在Pipe-BD PyTorch上实施管道-D, 实验显示Pipe-BD是多种情景的有效的数据模型。