The core of a self-supervised learning method for pre-training language models includes the design of appropriate data augmentation and corresponding pre-training task(s). Most data augmentations in language model pre-training are context-independent. The seminal contextualized augmentation recently proposed by the ELECTRA requires a separate generator, which leads to extra computation cost as well as the challenge in adjusting the capability of its generator relative to that of the other model component(s). We propose a self-augmented strategy (SAS) that uses a single forward pass through the model to augment the input data for model training in the next epoch. Essentially our strategy eliminates a separate generator network and uses only one network to generate the data augmentation and undertake two pre-training tasks (the MLM task and the RTD task) jointly, which naturally avoids the challenge in adjusting the generator's capability as well as reduces the computation cost. Additionally, our SAS is a general strategy such that it can seamlessly incorporate many new techniques emerging recently or in the future, such as the disentangled attention mechanism recently proposed by the DeBERTa model. Our experiments show that our SAS is able to outperform the ELECTRA and other state-of-the-art models in the GLUE tasks with the same or less computation cost.
翻译:培训前语言模式的自我监督学习方法的核心包括设计适当的数据增强和相应的培训前任务。语言模式培训前培训前的大多数数据增强都是根据具体情况而定的。ELECTRA最近提出的具有独特背景的增强要求单独一个发电机,这会导致额外的计算成本以及在调整其发电机相对于其他模式组成部分的能力方面遇到的挑战。我们建议了一种自我强化战略,通过该模式使用单一的前进路来扩大下一个时代的模型培训输入数据。我们的战略基本上消除了一个单独的发电机网络,只使用一个网络来产生数据增强,并联合执行两个培训前任务(MLM任务和RTD任务),这自然避免了调整发电机能力以及降低计算成本方面的挑战。此外,我们的SAS是一种总战略,它可以顺利地纳入最近或将来出现的许多新技术,例如DeBERTA模型最近提出的不相干的关注机制。我们的实验表明,我们GLUSAS与其他模型相比,其成本较低。我们的EBRA-LSAS实验表明,我们的E-TRA模型能够以同样的方式调整我们的GLAS。