Scene text editing (STE), which converts a text in a scene image into the desired text while preserving an original style, is a challenging task due to a complex intervention between text and style. In this paper, we propose a novel STE model, referred to as RewriteNet, that decomposes text images into content and style features and re-writes a text in the original image. Specifically, RewriteNet implicitly distinguishes the content from the style by introducing scene text recognition. Additionally, independent of the exact supervisions with synthetic examples, we propose a self-supervised training scheme for unlabeled real-world images, which bridges the domain gap between synthetic and real data. Our experiments present that RewriteNet achieves better generation performances than other comparisons. Further analysis proves the feature decomposition of RewriteNet and demonstrates the reliability and robustness through diverse experiments. Our implementation is publicly available at \url{https://github.com/clovaai/rewritenet}
翻译:场景图像文本编辑 (STE) 将场景图像中的文本转换为理想文本, 并同时保留原始风格, 这是一项艰巨的任务, 原因是文本和风格之间的复杂干预。 在本文中, 我们提议了一个名为 RewriteNet 的新型STE 模型, 将文本图像分解成内容和风格特性, 并在原始图像中重写文本。 具体地说, RewriteNet 通过引入场景文本识别, 隐含地区分内容和风格。 此外, 独立于对合成实例的精确监督, 我们提议了一个不贴标签的真实世界图像自我监督的培训计划, 以弥合合成数据与真实数据之间的域差。 我们的实验显示, REwriteNet 取得了比其他比较更好的一代性能。 进一步的分析证明了 RewriteNet 的特性分解, 并通过多种实验来显示可靠性和强性。 我们的操作在\url{https:// github.com/ clovaai/ rerereternet}