The core of self-supervised learning for pre-training language models includes pre-training task design as well as appropriate data augmentation. Most data augmentations in language model pre-training are context-independent. A seminal contextualized augmentation was recently proposed in ELECTRA and achieved state-of-the-art performance by introducing an auxiliary generation network (generator) to produce contextualized data augmentation for the training of a main discrimination network (discriminator). This design, however, introduces extra computation cost of the generator and a need to adjust the relative capability between the generator and the discriminator. In this paper, we propose a self-augmentation strategy (SAS) where a single network is utilized for both regular pre-training and contextualized data augmentation for the training in later epochs. Essentially, this strategy eliminates a separate generator and uses the single network to jointly conduct two pre-training tasks with MLM (Masked Language Modeling) and RTD (Replaced Token Detection) heads. It avoids the challenge to search for an appropriate size of the generator, which is critical to the performance as evidenced in ELECTRA and its subsequent variant models. In addition, SAS is a general strategy that can be seamlessly combined with many new techniques emerging recently or in the future, such as the disentangled attention mechanism from DeBERTa. Our experiments show that SAS is able to outperform ELECTRA and other state-of-the-art models in the GLUE tasks with similar or less computation cost.
翻译:培训前语言模型自我监督学习的核心包括培训前任务设计以及适当的数据增强。语言模式培训前多数数据增强是不受背景影响的。最近在ELECTRA中提议了一种先入为主的增强功能,并通过引入一个辅助生成网络(生成器)来产生背景化数据增强功能,用于培训主要歧视网络(差异模型),但这一设计引入了发电机的额外计算成本,并需要调整发电机与导师之间的相对能力。在本文中,我们建议了一种自我增强战略(SAS),即利用一个单一网络进行定期培训前和背景化数据增强功能,用于后期培训。基本上,该战略取消了一个单独的生成器,并使用单一网络联合开展两项培训前任务,与MLM(假语言模型)和RTD(替换托肯检测)头一起,引入了额外的计算成本,并需要调整发电机与导师之间的相对能力。我们建议了一个自我增强的自我增强战略(SAS)的恰当规模(SAS)战略(ELTRA)对于后期的运行模式至关重要,这在新的战略中可以证明为常规模式或未来模式,在新的变式中可以展示。