With the capability of modeling bidirectional contexts, denoising autoencoding based pretraining like BERT achieves better performance than pretraining approaches based on autoregressive language modeling. However, relying on corrupting the input with masks, BERT neglects dependency between the masked positions and suffers from a pretrain-finetune discrepancy. In light of these pros and cons, we propose XLNet, a generalized autoregressive pretraining method that (1) enables learning bidirectional contexts by maximizing the expected likelihood over all permutations of the factorization order and (2) overcomes the limitations of BERT thanks to its autoregressive formulation. Furthermore, XLNet integrates ideas from Transformer-XL, the state-of-the-art autoregressive model, into pretraining. Empirically, XLNet outperforms BERT on 20 tasks, often by a large margin, and achieves state-of-the-art results on 18 tasks including question answering, natural language inference, sentiment analysis, and document ranking.
翻译:由于模拟双向环境的能力,BERT等基于自动递减语言模型的预培训中解密自动编码比基于自动递减语言模型的预培训方法的性能要好。然而,依靠以面具腐蚀输入,BERT忽视了蒙面位置之间的依赖性,并存在先入为主的纤维差异。鉴于这些利弊,我们提议XLNet,这是一个普遍的自动递减前培训方法,它(1) 通过最大限度地提高因子化顺序所有变换的预期可能性,使学习双向环境成为学习的双向环境,(2) 克服了BERT的局限性,因为它的自动递减式配方。此外,XLNet将来自变式-XL(最先进的自动递减模式)的构想纳入预培训中。从时间上看,XLNet在20项任务上优于BERT,通常以大幅度完成,并在18项任务上取得最新水平的结果,包括问题解答、自然语言评断、情绪分析和文件排位。