We introduce Dessurt, a relatively simple document understanding transformer capable of being fine-tuned on a greater variety of document tasks than prior methods. It receives a document image and task string as input and generates arbitrary text autoregressively as output. Because Dessurt is an end-to-end architecture that performs text recognition in addition to the document understanding, it does not require an external recognition model as prior methods do. Dessurt is a more flexible model than prior methods and is able to handle a variety of document domains and tasks. We show that this model is effective at 9 different dataset-task combinations.
翻译:我们引入了一个相对简单的文件理解变压器Dessurt, 这个变压器能够比以前的方法更精确地调整更多的文档任务。 它作为输入接收一个文档图像和任务字符串, 并生成任意的文字自动递增为输出。 因为 Dessurt 是一个端对端结构, 除了文件理解外, 还可以进行文字识别, 不需要像以前的方法那样使用外部识别模式。 Dessurt 比以前的方法更灵活, 并且能够处理各种文档域和任务。 我们显示, 这个模式在9个不同的数据设置- 任务组合中有效 。