与培训前变换人一起处理长期法律文件:改换法律、改用法律、改用法律、改用法律、改用法律、改用法律、改用法律、改用法律、改用法律、改用法律、改用法律、改用法律、改用法律、改用法律、改用法律、改用法律、改用法律、改用法律、改用法律、改用法律、改用法律、改用法律、改用法律、改用法律、改用法律、改用法律、改用法律、改用法律、改用法律、改用法律、改用法律 (Processing Long Legal Documents with Pre-trained Transformers: Modding LegalBERT and Longformer)

Dimitris Mamakas,Petros Tsotsi,Ion Androutsopoulos,Ilias Chalkidis

from arxiv, 9 pages, long paper at NLLP Workshop 2022 proceedings

Pre-trained Transformers currently dominate most NLP tasks. They impose, however, limits on the maximum input length (512 sub-words in BERT), which are too restrictive in the legal domain. Even sparse-attention models, such as Longformer and BigBird, which increase the maximum input length to 4,096 sub-words, severely truncate texts in three of the six datasets of LexGLUE. Simpler linear classifiers with TF-IDF features can handle texts of any length, require far less resources to train and deploy, but are usually outperformed by pre-trained Transformers. We explore two directions to cope with long legal texts: (i) modifying a Longformer warm-started from LegalBERT to handle even longer texts (up to 8,192 sub-words), and (ii) modifying LegalBERT to use TF-IDF representations. The first approach is the best in terms of performance, surpassing a hierarchical version of LegalBERT, which was the previous state of the art in LexGLUE. The second approach leads to computationally more efficient models at the expense of lower performance, but the resulting models still outperform overall a linear SVM with TF-IDF features in long legal document classification.

翻译：培训前的变压器目前控制着大多数NLP任务,但是对最大输入长度(BERT中的512个子字)施加限制,这些限制在法律领域限制太强。即使是Longfore和BigBird等少见的注意模式,将最大输入长度提高到4 096个小字,在LexGLUE的六套数据集中,有六套数据集中的三套中严重删除文本。具有TF-IDF特点的简单线性分类器可以处理任何长度的文本,需要远小于培训和部署的资源,但通常由预先培训的变压器完成。我们探索了两种方向来应对长期的法律文本。我们探索了两种方向:(一)修改从Leganexew和BIBird开始的长期热量模型,以便处理更长的文本(最多为8 192个子字),以及(二)修改LegaBERTF,以便使用TF-IDF的表述方式。第一种是最佳的绩效,超过LEBERT的等级版本,这是LEGLEPERT的原版,这是LGLLPERUE的工艺状态。第二种方法导致计算效率更高的模式,在计算中以较低的模式中以较低性模式,但是仍然以降低了整个文件格式的立式格式。

相关内容

TF-IDF

关注 0

TF-IDF（英语：term frequency–inverse document frequency）是一种用于信息检索与文本挖掘的常用加权技术。tf-idf是一种统计方法，用以评估一字词对于一个文件集或一个语料库中的其中一份文件的重要程度。字词的重要性随着它在文件中出现的次数成正比增加，但同时会随着它在语料库中出现的频率成反比下降。tf-idf加权的各种形式常被搜索引擎应用，作为文件与用户查询之间相关程度的度量或评级。除了tf-idf以外，互联网上的搜索引擎还会使用基于链接分析的评级方法，以确定文件在搜索结果中出现的顺序。

不可错过！《机器学习100讲》课程，UBC Mark Schmidt讲授

专知会员服务

75+阅读 · 2022年6月28日

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

专知会员服务

78+阅读 · 2022年3月15日

【经典书】数据挖掘：理论、算法与示例，347页pdf，Nong Ye，Arizona State University

专知会员服务

82+阅读 · 2020年2月27日

2019年自然语言处理NLP亮点总结，29页pdf，NLP Year in Review — 2019 NLP highlights for the year 2019.

专知会员服务

69+阅读 · 2020年1月2日