Many business workflows require extracting important fields from form-like documents (e.g. bank statements, bills of lading, purchase orders, etc.). Recent techniques for automating this task work well only when trained with large datasets. In this work we propose a novel data augmentation technique to improve performance when training data is scarce, e.g. 10-250 documents. Our technique, which we call FieldSwap, works by swapping out the key phrases of a source field with the key phrases of a target field to generate new synthetic examples of the target field for use in training. We demonstrate that this approach can yield 1-7 F1 point improvements in extraction performance.
翻译:许多业务工作流程要求从类似表格的文件(例如银行对账单、提单、定购单等)中抽取重要领域。 只有在经过大型数据集培训时,这项工作才能很好地自动化。在这项工作中,我们提议采用新的数据增强技术,在培训数据稀缺时提高绩效,例如10-250份文件。我们称之为FieldSwap的技术将源字段的关键词与目标字段的关键词转换为目标字段的关键词,以产生用于培训的目标字段新的合成例子。我们证明,这一方法可以在提取绩效方面产生1-7 F1点的改进。