DoSA: 使用 " 网上人 " 加速商务文件说明系统 (DoSA : A System to Accelerate Annotations on Business Documents with Human-in-the-Loop)

Business documents come in a variety of structures, formats and information needs which makes information extraction a challenging task. Due to these variations, having a document generic model which can work well across all types of documents and for all the use cases seems far-fetched. For document-specific models, we would need customized document-specific labels. We introduce DoSA (Document Specific Automated Annotations), which helps annotators in generating initial annotations automatically using our novel bootstrap approach by leveraging document generic datasets and models. These initial annotations can further be reviewed by a human for correctness. An initial document-specific model can be trained and its inference can be used as feedback for generating more automated annotations. These automated annotations can be reviewed by human-in-the-loop for the correctness and a new improved model can be trained using the current model as pre-trained model before going for the next iteration. In this paper, our scope is limited to Form like documents due to limited availability of generic annotated datasets, but this idea can be extended to a variety of other documents as more datasets are built. An open-source ready-to-use implementation is made available on GitHub https://github.com/neeleshkshukla/DoSA.

翻译：商务文件有各种各样的结构、格式和信息需求,使得信息提取是一项具有挑战性的任务。由于这些差异, 拥有一个能够在所有类型文档和所有使用案例中行之有效的文档通用模型, 对所有类型的文档和所有使用案例来说, 似乎都是牵强的。对于具体文件的模型, 我们需要定制的文档专用标签。我们引入 DoSA (文件特定自动说明), 帮助批注员利用我们的新颖的“ 靴套” 方法, 自动生成初始说明。这些初步说明可以由人进一步审查, 以便得到正确性。这些初步的文件特定模型可以接受培训, 其推论可以用作生成更多自动说明的反馈。这些自动说明可以由在行内的人对正确性进行审评, 新的改进模型可以在进行下一次试编之前, 将目前的模型作为预先培训的模型加以培训。在本文中, 我们的范围限于像文件一样的形式, 因为通用的附加说明数据集有限, 但是这个概念可以扩展至更多的其他文件, 因为正在建立更多的数据集。一个开放源可预设/ Dobshuk/ SA 。

相关内容

Automator

关注 5

Automator是苹果公司为他们的Mac OS X系统开发的一款软件。 只要通过点击拖拽鼠标等操作就可以将一系列动作组合成一个工作流，从而帮助你自动的（可重复的）完成一些复杂的工作。Automator还能横跨很多不同种类的程序，包括：查找器、Safari网络浏览器、iCal、地址簿或者其他的一些程序。它还能和一些第三方的程序一起工作，如微软的Office、Adobe公司的Photoshop或者Pixelmator等。

Linux导论，Introduction to Linux，96页ppt

专知会员服务

81+阅读 · 2020年7月26日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

专知会员服务

160+阅读 · 2019年10月12日