A large amount of annotated training images is critical for training accurate and robust deep network models but the collection of a large amount of annotated training images is often time-consuming and costly. Image synthesis alleviates this constraint by generating annotated training images automatically by machines which has attracted increasing interest in the recent deep learning research. We develop an innovative image synthesis technique that composes annotated training images by realistically embedding foreground objects of interest (OOI) into background images. The proposed technique consists of two key components that in principle boost the usefulness of the synthesized images in deep network training. The first is context-aware semantic coherence which ensures that the OOI are placed around semantically coherent regions within the background image. The second is harmonious appearance adaptation which ensures that the embedded OOI are agreeable to the surrounding background from both geometry alignment and appearance realism. The proposed technique has been evaluated over two related but very different computer vision challenges, namely, scene text detection and scene text recognition. Experiments over a number of public datasets demonstrate the effectiveness of our proposed image synthesis technique - the use of our synthesized images in deep network training is capable of achieving similar or even better scene text detection and scene text recognition performance as compared with using real images.
翻译:大量附加说明的培训图像对于培训准确和稳健的深层网络模型至关重要,但收集大量附加说明的培训图像往往耗费时间和费用。图像合成通过自动产生附加说明的培训图像,从而缓解了这一制约,因为机器自动生成了附加说明的培训图像,而机器对最近的深层学习研究的兴趣日益浓厚。我们开发了一种创新的图像合成技术,将附加说明的培训图像实际嵌入背景图像中,从而形成附加说明的培训图像。拟议技术由两个关键组成部分组成,原则上促进在深层网络培训中合成图像的实用性。第一个要素是符合背景特征的语义一致性,确保OOI在背景图像中分布在具有一致性的区域周围。第二个要素是和谐的外观适应,确保嵌入的OOI与周围背景的背景一致,既从几何对齐和外观现实主义角度入手。对两种相关但非常不同的计算机视觉挑战进行了评估,即现场文字探测和图像识别。对一些公共数据集的实验展示了我们拟议的图像合成技术的有效性——在深层网络中使用综合图像,并且通过对图像进行更精确的检测或图像进行更精确的图像识别。