A real-world information extraction (IE) system for semi-structured document images often involves a long pipeline of multiple modules, whose complexity dramatically increases its development and maintenance cost. One can instead consider an end-to-end model that directly maps the input to the target output and simplify the entire process. However, such generation approach is known to lead to unstable performance if not designed carefully. Here we present our recent effort on transitioning from our existing pipeline-based IE system to an end-to-end system focusing on practical challenges that are associated with replacing and deploying the system in real, large-scale production. By carefully formulating document IE as a sequence generation task, we show that a single end-to-end IE system can be built and still achieve competent performance.
翻译:半结构化文件图像真实世界信息提取系统(IE)往往涉及由多个模块组成的长期管道,这些模块的复杂性大大增加了其开发和维护成本。我们可以考虑一个端对端模式,直接绘制目标产出输入图,并简化整个过程。然而,这种生成方法已知如果不仔细设计,就会导致工作不稳。这里我们介绍我们最近为从现有基于管道的IE系统过渡到一个端对端系统所做的努力,重点是在实际大规模生产中替换和部署系统所带来的实际挑战。通过仔细编制文件IE,作为序列生成任务,我们表明可以建立一个单一端对端的IE系统,并且仍然能够取得胜任的业绩。