Business documents come in a variety of structures, formats and information needs which makes information extraction a challenging task. Due to these variations, having a document generic model which can work well across all types of documents and for all the use cases seems far-fetched. For document-specific models, we would need customized document-specific labels. We introduce DoSA (Document Specific Automated Annotations), which helps annotators in generating initial annotations automatically using our novel bootstrap approach by leveraging document generic datasets and models. These initial annotations can further be reviewed by a human for correctness. An initial document-specific model can be trained and its inference can be used as feedback for generating more automated annotations. These automated annotations can be reviewed by human-in-the-loop for the correctness and a new improved model can be trained using the current model as pre-trained model before going for the next iteration. In this paper, our scope is limited to Form like documents due to limited availability of generic annotated datasets, but this idea can be extended to a variety of other documents as more datasets are built. An open-source ready-to-use implementation is made available on GitHub https://github.com/neeleshkshukla/DoSA.
翻译:商务文件有各种各样的结构、格式和信息需求,使得信息提取是一项具有挑战性的任务。 由于这些差异, 拥有一个能够在所有类型文档和所有使用案例中行之有效的文档通用模型, 对所有类型的文档和所有使用案例来说, 似乎都是牵强的。 对于具体文件的模型, 我们需要定制的文档专用标签。 我们引入 DoSA (文件特定自动说明), 帮助批注员利用我们的新颖的“ 靴套” 方法, 自动生成初始说明。 这些初步说明可以由人进一步审查, 以便得到正确性。 这些初步的文件特定模型可以接受培训, 其推论可以用作生成更多自动说明的反馈。 这些自动说明可以由在行内的人对正确性进行审评, 新的改进模型可以在进行下一次试编之前, 将目前的模型作为预先培训的模型加以培训。 在本文中, 我们的范围限于像文件一样的形式, 因为通用的附加说明数据集有限, 但是这个概念可以扩展至更多的其他文件, 因为正在建立更多的数据集。 一个开放源可预设/ Dobshuk/ SA 。