Inductive transfer learning has taken the entire NLP field by storm, with models such as BERT and BART setting new state of the art on countless NLU tasks. However, most of the available models and research have been conducted for English. In this work, we introduce BARThez, the first large-scale pretrained seq2seq model for French. Being based on BART, BARThez is particularly well-suited for generative tasks. We evaluate BARThez on five discriminative tasks from the FLUE benchmark and two generative tasks from a novel summarization dataset, OrangeSum, that we created for this research. We show BARThez to be very competitive with state-of-the-art BERT-based French language models such as CamemBERT and FlauBERT. We also continue the pretraining of a multilingual BART on BARThez' corpus, and show our resulting model, mBARThez, to significantly boost BARThez' generative performance. Code, data and models are publicly available.
翻译:感应转移学习通过风暴将整个NLP领域推向了风暴,模型如BERT和BART为无数NLU任务确立了新的艺术状态。然而,大多数现有模型和研究都是为英语进行的。在这项工作中,我们为法语引入了第一个大型的预先培训的后继2等模型BARThez。基于BART,BARThez特别适合基因化任务。我们评估了五个来自FLUE基准的差别性任务BARThez,以及我们为这项研究创建的新型合成数据集OrangeSum的两件基因化任务。我们显示BARThez与基于最新技术的BERT的法语模型,如CamemBERT和FlauBERT非常有竞争力。我们还继续对BART进行多语种的BART进行关于BART的预培训,并展示我们由此产生的模型MBARthez,以显著提升BARThez的基因化表现。代码、数据和模型是公开的。