Document Information Extraction (DIE) has attracted increasing attention due to its various advanced applications in the real world. Although recent literature has already achieved competitive results, these approaches usually fail when dealing with complex documents with noisy OCR results or mutative layouts. This paper proposes Generative Multi-modal Network (GMN) for real-world scenarios to address these problems, which is a robust multi-modal generation method without predefined label categories. With the carefully designed spatial encoder and modal-aware mask module, GMN can deal with complex documents that are hard to serialized into sequential order. Moreover, GMN tolerates errors in OCR results and requires no character-level annotation, which is vital because fine-grained annotation of numerous documents is laborious and even requires annotators with specialized domain knowledge. Extensive experiments show that GMN achieves new state-of-the-art performance on several public DIE datasets and surpasses other methods by a large margin, especially in realistic scenes.
翻译:文件摘取(DIE)因其在现实世界中的各种先进应用而引起越来越多的注意。虽然最近的一些文献已经取得了竞争性的成果,但在处理具有超音速OCR结果或变异布局的复杂文件时,这些方法通常会失败。本文件建议为现实世界的情景创建多模式网络(GMN)以解决这些问题,这是一种强有力的多模式生成方法,没有预先界定的标签类别。在精心设计的空间编码器和模式识别面罩模块中,GMN可以处理难以按顺序序列排列的复杂文件。此外,GMN可以容忍OCR结果中的错误,并且不需要字符级的注释,这一点至关重要,因为对大量文件的精细加标记是累累的,甚至需要具有专门领域知识的批注员。广泛的实验表明,GMN在一些公共DIE数据集中取得了新的状态性表现,并且大大超过其他方法,特别是在现实的场景中。