In recent years, large pre-trained Transformer-based language models have led to dramatic improvements in many natural language understanding tasks. To train these models with increasing sizes, many neural network practitioners attempt to increase the batch sizes in order to leverage multiple GPUs to improve training speed. However, increasing the batch size often makes the optimization more difficult, leading to slow convergence or poor generalization that can require orders of magnitude more training time to achieve the same model quality. In this paper, we explore the steepness of the loss landscape of large-batch optimization for adapting pre-trained Transformer-based language models to domain-specific tasks and find that it tends to be highly complex and irregular, posing challenges to generalization on downstream tasks. To tackle this challenge, we propose ScaLA, a novel and efficient method to accelerate the adaptation speed of pre-trained transformer networks. Different from prior methods, we take a sequential game-theoretic approach by adding lightweight adversarial noise into large-batch optimization, which significantly improves adaptation speed while preserving model generalization. Experiment results show that ScaLA attains 2.7--9.8$\times$ adaptation speedups over the baseline for GLUE on BERT-base and RoBERTa-large, while achieving comparable and sometimes higher accuracy than the state-of-the-art large-batch optimization methods. Finally, we also address the theoretical aspect of large-batch optimization with adversarial noise and provide a theoretical convergence rate analysis for ScaLA using techniques for analyzing non-convex saddle-point problems.
翻译:近年来,大量培训前的变异语言模型导致许多自然语言理解任务的大幅改进。为了对这些模型进行规模越来越大的培训,许多神经网络从业者试图增加批量规模,以便利用多个GPU来提高培训速度。然而,批量规模的扩大往往使优化更加困难,导致趋同速度缓慢或概括性差,从而需要数量级级更高的培训时间才能达到同样的模式质量。在本文件中,我们探讨了大批量优化的流失场景,以调整预先培训的变异语言模型,使之适应特定领域的任务,发现这些模型往往非常复杂和不正常,对下游任务的概括化提出了挑战。为了应对这一挑战,我们建议ScaLA,这是加速预先培训变异网络适应速度的一种新颖而有效的方法。不同于以往的方法,我们采用顺序游戏理论方法,将轻量的对抗噪音添加到大批量的优化中,从而大大提高了适应速度,同时保持了模型的通用。实验结果表明,ScaLA达到2.7-9.8美元的非常规性,对下游任务提出了挑战。我们建议ScaLA,这是加快调整速度速度速度速度速度的大规模基准,而有时使用比GIRB-BA-BA-B-B-B-B-B-B-B-B-B-B-B-B-B-B-B-B-B-B-B-S-B-S-S-S-S-S-S-S-S-S-S-S-S-B-S-S-S-S-S-S-S-S-S-S-B-B-B-B-B-B-S-B-S-S-S-S-S-B-B-B-B-B-B-S-S-B-S-B-S-S-S-B-B-B-B-B-S-S-S-S-S-S-S-S-B-B-B-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-B-B-B-B-B-B-S-S-