Existing Scene Text Recognition (STR) methods typically use a language model to optimize the joint probability of the 1D character sequence predicted by a visual recognition (VR) model, which ignore the 2D spatial context of visual semantics within and between character instances, making them not generalize well to arbitrary shape scene text. To address this issue, we make the first attempt to perform textual reasoning based on visual semantics in this paper. Technically, given the character segmentation maps predicted by a VR model, we construct a subgraph for each instance, where nodes represent the pixels in it and edges are added between nodes based on their spatial similarity. Then, these subgraphs are sequentially connected by their root nodes and merged into a complete graph. Based on this graph, we devise a graph convolutional network for textual reasoning (GTR) by supervising it with a cross-entropy loss. GTR can be easily plugged in representative STR models to improve their performance owing to better textual reasoning. Specifically, we construct our model, namely S-GTR, by paralleling GTR to the language model in a segmentation-based STR baseline, which can effectively exploit the visual-linguistic complementarity via mutual learning. S-GTR sets new state-of-the-art on six challenging STR benchmarks and generalizes well to multi-linguistic datasets. Code is available at https://github.com/adeline-cs/GTR.
翻译:现有显示文本识别方法通常使用一种语言模型,优化视觉识别模型(VR)预测的 1D 字符序列的共同概率,该模型忽略了字符实例内和之间视觉语义学的2D空间背景,使其不完全概括到任意塑造场景文字。为了解决这一问题,我们首次尝试根据本文中的视觉语义进行文字推理。在技术上,鉴于VR 模型预测的字符分隔图,我们为每个实例绘制了一个子图,其中节点代表了其中的像素,根据空间相似性在节点之间添加边际。然后,这些子图通过根节点和完整图表相继连接,并合并成完整的图表。基于这个图,我们设计了一个基于视觉推理的图形进式推理网络(GTR ) 。GTR可以很容易地在具有代表性的TR模型中插入,以便通过更好的文字推理来改进它们的性能。具体地说,我们构建我们的模型,即S-GTR,通过它们之间的根基点节点连接GTR与语言模型平行连接,通过新的断式数据库,可以有效地利用STR-strevidustrisal-strismatial数据库数据库数据库数据库数据库,通过新的Strading stravidustrislismal-s-st-st-st-stal