Document layout analysis is a known problem to the documents research community and has been vastly explored yielding a multitude of solutions ranging from text mining, and recognition to graph-based representation, visual feature extraction, etc. However, most of the existing works have ignored the crucial fact regarding the scarcity of labeled data. With growing internet connectivity to personal life, an enormous amount of documents had been available in the public domain and thus making data annotation a tedious task. We address this challenge using self-supervision and unlike, the few existing self-supervised document segmentation approaches which use text mining and textual labels, we use a complete vision-based approach in pre-training without any ground-truth label or its derivative. Instead, we generate pseudo-layouts from the document images to pre-train an image encoder to learn the document object representation and localization in a self-supervised framework before fine-tuning it with an object detection model. We show that our pipeline sets a new benchmark in this context and performs at par with the existing methods and the supervised counterparts, if not outperforms. The code is made publicly available at: https://github.com/MaitySubhajit/SelfDocSeg
翻译:文档布局分析是文档研究领域中已知的问题,已被广泛探索,产生了大量的解决方案,从文本挖掘和识别到基于图形的表示、视觉特征提取等等。然而,现有的大多数作品忽视了有关标签数据稀缺性的重要事实。随着因互联网连接到个人生活而产生的海量文档,数据注释成为了一项繁琐的任务。我们采用自我监督方法来应对这一挑战,与少数现有的自我监督文档分割方法不同,我们使用一种完全基于视觉的方法在无真实标签或其衍生物的情况下进行预训练。我们从文档图像中生成伪布局,以在自我监督框架中训练图像编码器学习文档对象表示和定位,在微调目标检测模型之前。我们展示了我们的流水线在这个上下文中确立了一个新的基准,并与现有方法以及监督对照相当,如果不是更好。该代码可以在以下位置公开获得:https://github.com/MaitySubhajit/SelfDocSeg