When extracting structured data from repetitively organized documents, such as dictionaries, directories, or even newspapers, a key challenge is to correctly segment what constitutes the basic text regions for the target database. Traditionally, such a problem was tackled as part of the layout analysis and was mostly based on visual clues for dividing (top-down) approaches. Some agglomerating (bottom-up) approaches started to consider textual information to link similar contents, but they required a proper over-segmentation of fine-grained units. In this work, we propose a new pragmatic approach whose efficiency is demonstrated on 19th century French Trade Directories. We propose to consider two sub-problems: coarse layout detection (text columns and reading order), which is assumed to be effective and not detailed here, and a fine-grained entry separation stage for which we propose to adapt a state-of-the-art Named Entity Recognition (NER) approach. By injecting special visual tokens, coding, for instance, indentation or breaks, into the token stream of the language model used for NER purpose, we can leverage both textual and visual knowledge simultaneously. Code, data, results and models are available at https://github.com/soduco/paper-entryseg-icdar23-code, https://huggingface.co/HueyNemud/ (icdar23-entrydetector* variants)
翻译:在从重复组织的文件,如字典、目录或甚至报纸中提取结构化数据时,一个关键的挑战是如何正确分割构成目标数据库基本文本区域的内容。传统上,这个问题是作为布局分析的一部分处理的,而且主要基于可分化(上下)方法的视觉线索。一些集成(自下而上)方法开始考虑文本信息,将类似内容联系起来,但需要对精细单位进行适当的过度分类。在这项工作中,我们提出一种新的务实方法,其效率在19世纪的法国贸易部显示。我们提议考虑两个子问题:粗体布局探测(文字栏和阅读顺序),假定此处是有效的,不详细,以及精细的入入门分离阶段,为此我们提议调整一个“最新艺术命名实体识别”方法。通过注入特殊直观符号,例如,对图表或断裂,将用于NER目的的语言模型的符号流(我们可以利用的文本和视觉模型)/变式/变式/变式。我们可以同时利用“http-del-decol”数据、http-deal-deal 代码、http-magical-commal/deal/codeal/codeal)。