从结构化文件的相似性和信息提取中学习 (Learning from similarity and information extraction from structured documents)

The automation of document processing is gaining recent attention due to the great potential to reduce manual work through improved methods and hardware. Neural networks have been successfully applied before - even though they have been trained only on relatively small datasets with hundreds of documents so far. To successfully explore deep learning techniques and improve the information extraction results, a dataset with more than twenty-five thousand documents has been compiled, anonymized and is published as a part of this work. We will expand our previous work where we proved that convolutions, graph convolutions and self-attention can work together and exploit all the information present in a structured document. Taking the fully trainable method one step further, we will now design and examine various approaches to using siamese networks, concepts of similarity, one-shot learning and context/memory awareness. The aim is to improve micro F1 of per-word classification on the huge real-world document dataset. The results verify the hypothesis that trainable access to a similar (yet still different) page together with its already known target information improves the information extraction. Furthermore, the experiments confirm that all proposed architecture parts are all required to beat the previous results. The best model improves the previous state-of-the-art results by an 8.25 gain in F1 score. Qualitative analysis is provided to verify that the new model performs better for all target classes. Additionally, multiple structural observations about the causes of the underperformance of some architectures are revealed. All the source codes, parameters and implementation details are published together with the dataset in the hope to push the research boundaries since all the techniques used in this work are not problem-specific and can be generalized for other tasks and contexts.

翻译：由于通过改进方法和硬件减少手工工作的巨大潜力,文件处理的自动化正在引起最近人们的关注。神经网络过去曾成功应用过――尽管它们仅经过相对小的数据集培训,到目前为止已有数百份文件。为了成功地探索深层学习技术和改进信息提取结果,已经汇编了2万5千多份文件的数据集,将其匿名,并作为这项工作的一部分予以出版。我们将扩大我们以前的工作,在那里,我们证明演算、图形演算和自控参数可以一起工作并利用结构文档中的所有信息。如果采用完全可训练的方法,1 并且进一步推进了界限,我们现在将设计并审查各种办法,使用结构网络、类似概念、一张照片学习和背景/模拟意识。目的是在巨大的真实世界文件数据集中改进缩略微F1的单字分类。结果将证实,可以对类似来源的(仍然不同)页面进行训练的模型及其已知的目标信息进行改进信息提取。此外,如果将完全可训练的方法,我们将进一步推进完全可训练的方法,1,我们现在将设计和审查各种拟议结构结构结构结构的流程,自过去以来,将改进之前的结果。在进行这种改进。为了改进之前的结果,可以改进。在前一阶段进行所有结构分析,在前阶段进行所有结构分析时,可以改进。在前阶段进行这种分析时,可以改进。在采用所有的方法进行。在前阶段进行。在前阶段进行这种分析时,可以改进。在改进。在进行所有结构结构的计算中进行这种改进。