Recent innovations on hardware (e.g. Nvidia A100) have motivated learning N:M structured sparsity masks from scratch for fast model inference. However, state-of-the-art learning recipes in this regime (e.g. SR-STE) are proposed for non-adaptive optimizers like momentum SGD, while incurring non-trivial accuracy drop for Adam-trained models like attention-based LLMs. In this paper, we first demonstrate such gap origins from poorly estimated second moment (i.e. variance) in Adam states given by the masked weights. We conjecture that learning N:M masks with Adam should take the critical regime of variance estimation into account. In light of this, we propose STEP, an Adam-aware recipe that learns N:M masks with two phases: first, STEP calculates a reliable variance estimate (precondition phase) and subsequently, the variance remains fixed and is used as a precondition to learn N:M masks (mask-learning phase). STEP automatically identifies the switching point of two phases by dynamically sampling variance changes over the training trajectory and testing the sample concentration. Empirically, we evaluate STEP and other baselines such as ASP and SR-STE on multiple tasks including CIFAR classification, machine translation and LLM fine-tuning (BERT-Base, GPT-2). We show STEP mitigates the accuracy drop of baseline recipes and is robust to aggressive structured sparsity ratios.
翻译:在硬件(如Nvidia A100)方面,最近创新的硬件(如Nvidia A100)促使人们从零开始从零开始学习N:M结构结构的封闭面罩,以便进行快速模型推断。然而,我们推测,在这一制度中,为动力SGD等非适应性优化优化剂(如SR-STE)提议了最先进的学习配方(如SR-STE),同时给以关注为基础的LLMM等接受亚当训练的模型带来非边际准确性下降。在本文件中,我们首先显示了这种差距源于对以蒙面重量为特征的亚当邦的第二次估算不准确(即差异性)。我们推测,学习N:M与Adam的掩码应该将关键的差异估算制度考虑在内。 有鉴于此,我们建议STEP, 亚当-aware 配方, 学习N:M 面具分为两个阶段:第一,STEP计算可靠的差异估计(前期阶段),之后,差异仍然是我们学习N:M 掩码(数学-学习阶段)。STEP) 自动确定两个阶段的切换位点,我们通过对培训阶段的变化进行精确取样,包括ST-ST-ST-ST-ST-ST-ST-ST-SB的基线和测试的分类。