Document layout analysis is a known problem to the documents research community and has been vastly explored yielding a multitude of solutions ranging from text mining, and recognition to graph-based representation, visual feature extraction, \emph{etc}. However, most of the existing works have ignored the crucial fact regarding the scarcity of labeled data. With growing internet connectivity to personal life, an enormous amount of documents had been available in the public domain and thus making data annotation a tedious task. We address this challenge using self-supervision and unlike, the few existing self-supervised document segmentation approaches which use text mining and textual labels, we use a complete vision-based approach in pre-training without any ground-truth label or its derivative. Instead, we generate pseudo-layouts from the document images to pre-train an image encoder to learn the document object representation and localization in a self-supervised framework before fine-tuning it with an object detection model. We show that our pipeline sets a new benchmark in this context and performs at par with the existing methods and the supervised counterparts, if not outperforms. The code is made publicly available at: \href{https://github.com/MaitySubhajit/SelfDocSeg}{github.com/MaitySubhajit/SelfDocSeg
翻译:文档布局分析是文档研究社区中众所周知的问题,已经得到广泛的探索,产生了各种解决方案,从文本挖掘,识别到基于图形表示和视觉特征提取等。然而,大多数现有的作品忽略了标注数据稀缺性这一关键事实。随着互联网连接和个人生活日益密切,大量的文档已经在公共领域中可用,因此使数据注释成为一项繁琐的任务。我们使用自我监督来解决这个挑战。与现有的少数采用文本挖掘和文本标签的自我监督文档分割方法不同,我们使用完全基于视觉的方法进行预训练,而不使用任何地面实况标签或其导数。我们从文档图像中生成伪布局,以在自我监督框架中预训练图像编码器,学习文档对象表示和本地化,然后在微调中使用对象检测模型。我们展示了我们的管道在这个环境中设立了新的基准,并与现有方法和监督对应品相提升,而不是超越。代码公开在 GitHub 上可用:\href {https://github.com/MaitySubhajit/SelfDocSeg} {github.com/MaitySubhajit/SelfDocSeg}