Large, pre-trained transformer models like BERT have achieved state-of-the-art results on document understanding tasks, but most implementations can only consider 512 tokens at a time. For many real-world applications, documents can be much longer, and the segmentation strategies typically used on longer documents miss out on document structure and contextual information, hurting their results on downstream tasks. In our work on legal agreements, we find that visual cues such as layout, style, and placement of text in a document are strong features that are crucial to achieving an acceptable level of accuracy on long documents. We measure the impact of incorporating such visual cues, obtained via computer vision methods, on the accuracy of document understanding tasks including document segmentation, entity extraction, and attribute classification. Our method of segmenting documents based on structural metadata out-performs existing methods on four long-document understanding tasks as measured on the Contract Understanding Atticus Dataset.
翻译:BERT等大型、预先培训的变压器模型在文件理解任务方面取得了最先进的成果,但大多数执行项目只能一次考虑512个标记。对于许多现实世界应用来说,文件可以更长得多的时间,对于较长的文档使用的分解战略通常会丢失文件结构和背景信息,从而损害其在下游任务上的结果。在我们关于法律协议的工作中,我们发现,像布局、风格和文本在文件中的位置这样的直观提示对于在长文件上达到可接受的准确度至关重要。我们衡量了通过计算机视觉方法获得的这些直观提示对文件理解任务的准确性的影响,包括文件分解、实体提取和属性分类。我们基于结构性元数据分解文件的方法在以合同理解阿提库数据集衡量的四种长文件理解任务上符合现有方法。