Scene text editing (STE), which converts a text in a scene image into the desired text while preserving an original style, is a challenging task due to a complex intervention between text and style. To address this challenge, we propose a novel representational learning-based STE model, referred to as RewriteNet that employs textual information as well as visual information. We assume that the scene text image can be decomposed into content and style features where the former represents the text information and style represents scene text characteristics such as font, alignment, and background. Under this assumption, we propose a method to separately encode content and style features of the input image by introducing the scene text recognizer that is trained by text information. Then, a text-edited image is generated by combining the style feature from the original image and the content feature from the target text. Unlike previous works that are only able to use synthetic images in the training phase, we also exploit real-world images by proposing a self-supervised training scheme, which bridges the domain gap between synthetic and real data. Our experiments demonstrate that RewriteNet achieves better quantitative and qualitative performance than other comparisons. Moreover, we validate that the use of text information and the self-supervised training scheme improves text switching performance. The implementation and dataset will be publicly available.
翻译:场景文本编辑 (STE) 将场景图像中的文本转换成理想文本, 并同时保留原始风格, 这是一项艰巨的任务, 原因是文本和风格之间的复杂干预。 为了应对这一挑战, 我们提出一个新的基于演示学习的STE模型, 称为RewriteNet, 使用文本信息以及视觉信息。 我们假设场景文本图像可以分解成内容和风格特征, 前者代表文本信息, 风格代表字体、 校正和背景等场景文本特性。 在此假设下, 我们提出一种方法, 通过引入由文本信息培训的场景文本识别器, 来单独编码输入图像的内容和风格特征。 然后, 通过将原始图像的样式特征与目标文本的内容合并来生成文本编辑图像。 与以往只能在培训阶段使用合成图像的工程不同, 我们还利用真实世界图像, 提出一个自我监督的培训计划, 弥补合成数据与真实数据之间的域间差距。 我们的实验表明, RewriteNet 实现更好的定量和定性性能比其他比较。 此外, 我们验证文本的性变校正 将使用数据 。 我们将使用数据 将改进 。