Nowadays, as cameras are rapidly adopted in our daily routine, images of documents are becoming both abundant and prevalent. Unlike natural images that capture physical objects, document-images contain a significant amount of text with critical semantics and complicated layouts. In this work, we devise a generic unsupervised technique to learn multimodal affinities between textual entities in a document-image, considering their visual style, the content of their underlying text and their geometric context within the image. We then use these learned affinities to automatically cluster the textual entities in the image into different semantic groups. The core of our approach is a deep optimization scheme dedicated for an image provided by the user that detects and leverages reliable pairwise connections in the multimodal representation of the textual elements in order to properly learn the affinities. We show that our technique can operate on highly varying images spanning a wide range of documents and demonstrate its applicability for various editing operations manipulating the content, appearance and geometry of the image.
翻译:目前,随着我们日常活动迅速采用相机,文件图像正在变得丰富和普遍。与捕捉物理物体的自然图像不同的是,文件图像包含大量带有关键语义和复杂布局的文本。在这项工作中,我们设计了一种通用的、不受监督的技术,以学习在文件图像中文本实体之间的多式联运关联,考虑到它们的视觉风格、其基本文字的内容和图像中的几何背景。然后我们利用这些学到的相似性,将图像中的文本实体自动组合成不同的语义组。我们的方法的核心是一个深度优化方案,专门为用户提供的图像专门设计一个深度优化方案,该图像在文本元素的多式表达中探测和利用可靠的对对对式连接,以便正确了解其相似性。我们表明,我们的技术可以使用分布范围很广的图像操作,并显示其适用于调控图像的内容、外观和几何形状的各种编辑操作。