Language models have proven to be very useful when adapted to specific domains. Nonetheless, little research has been done on the adaptation of domain-specific BERT models in the French language. In this paper, we focus on creating a language model adapted to French legal text with the goal of helping law professionals. We conclude that some specific tasks do not benefit from generic language models pre-trained on large amounts of data. We explore the use of smaller architectures in domain-specific sub-languages and their benefits for French legal text. We prove that domain-specific pre-trained models can perform better than their equivalent generalised ones in the legal domain. Finally, we release JuriBERT, a new set of BERT models adapted to the French legal domain.
翻译:事实证明,语言模型在适应特定领域时非常有用,然而,在改用法国语的具体领域BERT模型方面,没有做多少研究。在本文件中,我们侧重于创建一种适应法国法律文本的语文模型,目的是帮助法律专业人员。我们的结论是,一些具体任务没有受益于在大量数据方面经过预先培训的通用语言模型。我们探索在特定领域次级语言中使用较小的结构及其对法国法律文本的好处。我们证明,在法律领域,特定领域预先培训的模型能够比同等的通用模型更好。最后,我们发布了一套适应法国法律领域的新的“JuriBERT”模型。