巴西法院文件,由使用自然语言处理方法与变换器的相似性组合在一起的巴西法院文件 (Brazilian Court Documents Clustered by Similarity Together Using Natural Language Processing Approaches with Transformers)

Recent advances in Artificial intelligence (AI) have leveraged promising results in solving complex problems in the area of Natural Language Processing (NLP), being an important tool to help in the expeditious resolution of judicial proceedings in the legal area. In this context, this work targets the problem of detecting the degree of similarity between judicial documents that can be achieved in the inference group, by applying six NLP techniques based on transformers, namely BERT, GPT-2 and RoBERTa pre-trained in the Brazilian Portuguese language and the same specialized using 210,000 legal proceedings. Documents were pre-processed and had their content transformed into a vector representation using these NLP techniques. Unsupervised learning was used to cluster the lawsuits, calculating the quality of the model based on the cosine of the distance between the elements of the group to its centroid. We noticed that models based on transformers present better performance when compared to previous research, highlighting the RoBERTa model specialized in the Brazilian Portuguese language, making it possible to advance in the current state of the art in the area of NLP applied to the legal sector.

翻译：人工智能(AI)最近的进展在解决自然语言处理(NLP)领域复杂问题方面取得了可喜的成果,这是帮助迅速解决法律领域司法程序的一个重要工具,在这方面,这项工作的目标是通过应用基于变压器的六种NLP技术,即BERT、GPT-2和RoBERTA预先培训的巴西葡萄牙语和使用21万种法律程序的同一专业技术,发现推断组内司法文件的相似程度,通过应用基于变压器的六种NLP技术,即BERT、GPT-2和RoBERTA, 利用这些NLP技术预先处理文件,其内容被转换成矢量代表,利用不受监督的学习将诉讼集中起来,计算模型的质量,根据该组内各组成部分与其中间体之间的距离来计算模型,我们注意到,与先前的研究相比,基于变压器的模型表现较好,突出巴西葡萄牙语专用的ROBERTA模型,从而有可能在NLP领域应用到法律部门的目前艺术状态上取得进展。