The Transformer architecture deeply changed the natural language processing, outperforming all previous state-of-the-art models. However, well-known Transformer models like BERT, RoBERTa, and GPT-2 require a huge compute budget to create a high quality contextualised representation. In this paper, we study several efficient pre-training objectives for Transformers-based models. By testing these objectives on different tasks, we determine which of the ELECTRA model's new features is the most relevant. We confirm that Transformers pre-training is improved when the input does not contain masked tokens and that the usage of the whole output to compute the loss reduces training time. Moreover, inspired by ELECTRA, we study a model composed of two blocks; a discriminator and a simple generator based on a statistical model with no impact on the computational performances. Besides, we prove that eliminating the MASK token and considering the whole output during the loss computation are essential choices to improve performance. Furthermore, we show that it is possible to efficiently train BERT-like models using a discriminative approach as in ELECTRA but without a complex generator, which is expensive. Finally, we show that ELECTRA benefits heavily from a state-of-the-art hyper-parameters search.
翻译:变换器结构深刻地改变了自然语言处理过程,优于以往所有最先进的模型。然而,众所周知的变换器模型,如BERT、ROBERTA和GPT-2等,需要巨大的计算预算来创建高质量的背景化代表。在本文中,我们研究了基于变换器模型的若干高效培训前目标。通过在不同任务上测试这些目标,我们确定ELECTRA模型中哪些新特征最为相关。我们确认,当输入不包含蒙面符号时,变换器预培训是改进的,并且使用整个产出来计算损失的计算会缩短培训时间。此外,在ELECTRA的启发下,我们研究一个由两个街区组成的模型;一个歧视器和一个简单的生成器,以统计模型为基础,对计算性能没有影响。此外,我们证明消除MAKK标记和在计算损失过程中考虑整个产出是提高性能的关键选择。此外,我们证明有可能在变换器中使用歧视方法来培训像BERT的模型,比如在ELECTRA,但没有复杂的变换器,这是非常昂贵的搜索器。最后,我们展示了ELTRA的高价。我们展示了E-TRA。我们展示了非常昂贵的计算机。我们展示了ETRA。