We propose a new paradigm to automatically generate training data with accurate labels at scale using the text-toimage synthesis frameworks (e.g., DALL-E, Stable Diffusion, etc.). The proposed approach decouples training data generation into foreground object mask generation and background (context) image generation. For foreground object mask generation, we use a simple textual template with object class name as input to DALL-E to generate a diverse set of foreground images. A foreground-background segmentation algorithm is then used to generate foreground object masks. Next, in order to generate context images, first a language description of the context is generated by applying an image captioning method on a small set of images representing the context. These language descriptions are then used to generate diverse sets of context images using the DALL-E framework. These are then composited with object masks generated in the first step to provide an augmented training set for a classifier. We demonstrate the advantages of our approach on four object detection datasets including on Pascal VOC and COCO object detection tasks. Furthermore, we also highlight the compositional nature of our data generation approach on out-of-distribution and zero-shot data generation scenarios.
翻译:我们提出一个新的模式,以便利用文字图像合成框架(例如DALL-E、稳定扩散等)自动生成具有准确标签的培训数据。 拟议的方法将数据生成的图像区分为前景对象遮罩和背景(Context)图像生成。 对于前景对象遮罩生成,我们使用一个带有对象类名称的简单文本模板作为DALL-E的输入输入到DALL-E, 以生成一套不同的前景图像。然后,使用前地- 地地面分割算法来生成前景对象遮罩。 其次,为了生成背景图像,首先对背景环境进行语言描述,在代表背景的一小套图像上应用图像说明方法。 这些语言描述随后用于利用 DALL-E 框架生成多种背景图像。 然后,这些描述与在第一步生成的物体遮罩相结合,以便为分类者提供强化的培训。 我们展示了我们四个对象探测数据集集的优势,包括Pascal VOC 和COCO 对象探测任务。 此外,我们还强调了数据生成情景的构成性质。