信息从文件中提取：现实环境中的问答与标记分类比较 (Information Extraction from Documents: Question Answering vs Token Classification in real-world setups)

Research in Document Intelligence and especially in Document Key Information Extraction (DocKIE) has been mainly solved as Token Classification problem. Recent breakthroughs in both natural language processing (NLP) and computer vision helped building document-focused pre-training methods, leveraging a multimodal understanding of the document text, layout and image modalities. However, these breakthroughs also led to the emergence of a new DocKIE subtask of extractive document Question Answering (DocQA), as part of the Machine Reading Comprehension (MRC) research field. In this work, we compare the Question Answering approach with the classical token classification approach for document key information extraction. We designed experiments to benchmark five different experimental setups : raw performances, robustness to noisy environment, capacity to extract long entities, fine-tuning speed on Few-Shot Learning and finally Zero-Shot Learning. Our research showed that when dealing with clean and relatively short entities, it is still best to use token classification-based approach, while the QA approach could be a good alternative for noisy environment or long entities use-cases.

翻译：摘要：文件智能和特别是文件关键信息提取（DocKIE）的研究主要是通过标记分类问题得到解决的。最近自然语言处理（NLP）和计算机视觉的突破，帮助构建了以文件文本、布局和图像模态的多模式理解为基础的文件预训练方法。然而，这些突破也促成了一种新的DocKIE子任务，即提取式文件问答（DocQA），作为机器阅读理解（MRC）研究领域的一部分。在本研究中，我们比较了问答方法与传统的标记分类方法在提取文件关键信息方面的效果。我们设计了实验来对比五种不同的实验设置：原始性能，对噪声环境的鲁棒性，提取长实体的能力，Few-Shot Learning的微调速度以及零样本学习等。我们的研究表明，在处理干净相对短的实体时，最好使用标记分类方法，而对于嘈杂的环境或长实体的情况，问答方法可能是一个不错的选择。