The application of Natural Language Processing (NLP) to specialized domains, such as the law, has recently received a surge of interest. As many legal services rely on processing and analyzing large collections of documents, automating such tasks with NLP tools emerges as a key challenge. Many popular language models, such as BERT or RoBERTa, are general-purpose models, which have limitations on processing specialized legal terminology and syntax. In addition, legal documents may contain specialized vocabulary from other domains, such as medical terminology in personal injury text. Here, we propose LegalRelectra, a legal-domain language model that is trained on mixed-domain legal and medical corpora. We show that our model improves over general-domain and single-domain medical and legal language models when processing mixed-domain (personal injury) text. Our training architecture implements the Electra framework, but utilizes Reformer instead of BERT for its generator and discriminator. We show that this improves the model's performance on processing long passages and results in better long-range text comprehension.
翻译:将自然语言处理(NLP)应用于专门领域,例如法律,最近引起了人们的极大兴趣。许多法律服务都依赖于大量文件的处理和分析,因此与自然语言处理工具自动化是一项关键挑战。许多流行语言模式,如BERT或RoBERTA,是通用模式,对专门法律术语和语法的处理有限制。此外,法律文件可能包含其他领域的专门词汇,如人身伤害文本中的医学术语。在这里,我们建议法律-语言模式,即法律-法律-语言模式,在法律-医学混合体方面受过培训。我们表明,在处理混合体(人身伤害)文本时,我们的模型在普通和单域医疗和法律语言模式方面有所改进。我们的培训架构实施Lectra框架,但使用改革者而不是BERT的生成者和歧视者。我们表明,这提高了模型在处理长篇和医学混合体方面的表现,并在更远距离的文本理解方面产生了结果。