Scene text removal (STR), a task of erasing text from natural scene images, has recently attracted attention as an important component of editing text or concealing private information such as ID, telephone, and license plate numbers. While there are a variety of different methods for STR actively being researched, it is difficult to evaluate superiority because previously proposed methods do not use the same standardized training/evaluation dataset. We use the same standardized training/testing dataset to evaluate the performance of several previous methods after standardized re-implementation. We also introduce a simple yet extremely effective Gated Attention (GA) and Region-of-Interest Generation (RoIG) methodology in this paper. GA uses attention to focus on the text stroke as well as the textures and colors of the surrounding regions to remove text from the input image much more precisely. RoIG is applied to focus on only the region with text instead of the entire image to train the model more efficiently. Experimental results on the benchmark dataset show that our method significantly outperforms existing state-of-the-art methods in almost all metrics with remarkably higher-quality results. Furthermore, because our model does not generate a text stroke mask explicitly, there is no need for additional refinement steps or sub-models, making our model extremely fast with fewer parameters. The dataset and code are available at this https://github.com/naver/garnet.
翻译:将文字从自然场景图像中删除是一项任务,即将文字从自然场景图像中抹除出去的任务,最近引起人们的注意,作为编辑文本的一个重要部分,或隐藏私人信息,例如ID、电话和车牌号码。虽然积极研究的STS有各种不同的方法,但很难评价优越性,因为以前提议的方法没有使用同样的标准化培训/评价数据集。我们使用同样的标准化培训/测试数据集来评价标准化的再实施后以前若干方法的性能。我们还在本文中引入了简单而极为有效的Gated 注意(GA)和Region-Interest Generation(Roig)方法。GAGIP在关注文本中以及周围区域的纹理和颜色时,将注意力集中在文字中风和颜色上,以便更准确地从输入图像中删除文字。RoIG用于只注重区域,而没有使用相同的标准培训整个图像来更有效地培训模型。在基准数据集的实验结果中显示,我们的方法大大超越了几乎所有计量标准中现有的状态/艺术方法,而且质量非常高。此外,我们的模型没有产生更多的文本模范模范模模。