Generating images with conditional descriptions gains increasing interests in recent years. However, existing conditional inputs are suffering from either unstructured forms (captions) or limited information and expensive labeling (scene graphs). For a targeted scene, the core items, objects, are usually definite while their interactions are flexible and hard to clearly define. Thus, we introduce a more rational setting, generating a realistic image from the objects and captions. Under this setting, objects explicitly define the critical roles in the targeted images and captions implicitly describe their rich attributes and connections. Correspondingly, a MOC-GAN is proposed to mix the inputs of two modalities to generate realistic images. It firstly infers the implicit relations between object pairs from the captions to build a hidden-state scene graph. So a multi-layer representation containing objects, relations and captions is constructed, where the scene graph provides the structures of the scene and the caption provides the image-level guidance. Then a cascaded attentive generative network is designed to coarse-to-fine generate phrase patch by paying attention to the most relevant words in the caption. In addition, a phrase-wise DAMSM is proposed to better supervise the fine-grained phrase-patch consistency. On COCO dataset, our method outperforms the state-of-the-art methods on both Inception Score and FID while maintaining high visual quality. Extensive experiments demonstrate the unique features of our proposed method.
翻译:以有条件描述生成图像,近年来兴趣增加。然而,现有的有条件投入要么缺乏结构化形式(内容),要么信息有限,标签昂贵(图表)。对于目标场景,核心项目、对象通常确定,而互动则灵活而难以明确界定。因此,我们引入了一个更合理的设置,从对象和标题中产生现实的图像。在此设置下,对象明确定义目标图像和字幕中的关键作用,暗含描述其丰富属性和连接。相应地,提议混合混合组合组合组合-GAN,以混合两种模式生成现实图像。首先,将目标对配方之间的隐含关系从标题中推断出来,以构建隐藏状态场景图。因此,我们构建了一个包含对象、关系和标题的多层代表,在图像图中提供场景结构,标题提供图像层面指导。然后,一个按级排列的细微缩缩图网络设计出短语,在标题中关注最相关的词句。此外,一个短语-直观的图像分析方法将显示我们高层次的直观方法。