Scene-text image synthesis techniques aimed at naturally composing text instances on background scene images are very appealing for training deep neural networks because they can provide accurate and comprehensive annotation information. Prior studies have explored generating synthetic text images on two-dimensional and three-dimensional surfaces based on rules derived from real-world observations. Some of these studies have proposed generating scene-text images from learning; however, owing to the absence of a suitable training dataset, unsupervised frameworks have been explored to learn from existing real-world data, which may not result in a robust performance. To ease this dilemma and facilitate research on learning-based scene text synthesis, we propose DecompST, a real-world dataset prepared using public benchmarks, with three types of annotations: quadrilateral-level BBoxes, stroke-level text masks, and text-erased images. Using the DecompST dataset, we propose an image synthesis engine that includes a text location proposal network (TLPNet) and a text appearance adaptation network (TAANet). TLPNet first predicts the suitable regions for text embedding. TAANet then adaptively changes the geometry and color of the text instance according to the context of the background. Our comprehensive experiments verified the effectiveness of the proposed method for generating pretraining data for scene text detectors.
翻译:在背景场景图像中自然形成文字实例的Speen-text图像合成技术非常吸引对深神经网络进行培训,因为这些网络能够提供准确和全面的说明信息。先前的研究已经探索过在基于现实世界观测的规则基础上在二维和三维表面生成合成文字图像。其中一些研究建议从学习中生成现场文字图像;然而,由于缺乏合适的培训数据集,探索了未经监督的框架,以便从现有的真实世界数据中学习,而这些数据可能不会产生强健的性能。为了缓解这一困境并促进对基于学习的现场文字合成的研究,我们提议了Decomst,即使用公共基准编制的真实世界数据集,其中包括三种说明类型:四边级的Boxes、中风级文本遮罩和文字磨损的图像。我们提议了一个图像合成引擎,其中包括一个文本定位建议网络(TLPNet)和一个文本外观适应网络(TAANet)。为了预测文本嵌入背景的合适区域,我们提议了TAANet,然后用一种真实世界数据集,然后对我们的拟议文本背景进行适应性测试。