In recent years, research on visual document understanding (VDU) has grown significantly, with a particular emphasis on the development of self-supervised learning methods. However, one of the significant challenges faced in this field is the limited availability of publicly accessible visual corpora or extensive collections of images with detailed text annotations, particularly for non-Latin or resource-scarce languages. To address this challenge, we propose Web-based Visual Corpus Builder (Webvicob), a dataset generator engine capable of constructing large-scale, multilingual visual corpora from raw Wikipedia HTML dumps. Our experiments demonstrate that the data generated by Webvicob can be used to train robust VDU models that perform well on various downstream tasks, such as DocVQA and post-OCR parsing. Furthermore, when using a dataset of 1 million images generated by Webvicob, we observed an improvement of over 13% on the DocVQA Task 3 compared to a dataset of 11 million images from the IIT-CDIP. The implementation of our engine is publicly available on https://github.com/clovaai/webvicob
翻译:近年来,视觉文档理解(VDU)的研究显著增长,特别是在自监督学习方法的开发方面。然而,该领域面临的主要挑战之一是公开可用的视觉语料库或具有详细文本注释的大量图像集的有限可用性,特别是对于非拉丁或资源匮乏的语言。为了解决这个挑战,我们提出了Web-based Visual Corpus Builder(简记为 Webvicob ),这是一个数据集生成引擎,能够从原始的Wikipedia HTML转储中构建大规模、多语言的视觉语料库。我们的实验表明,Webvicob 生成的数据可以用于训练能够良好执行各种下游任务的强大 VDU 模型,例如 DocVQA 和后OCR分析 。此外,当使用 Webvicob 生成的 100 万张图像数据集时,我们观察到在 DocVQA Task 3 上的性能改进超过了 11 百万张图像的 IIT-CDIP 数据集的 13%。我们的引擎的实现可在 https://github.com/clovaai/webvicob 上公开获取。