Stylized image captioning as presented in prior work aims to generate captions that reflect characteristics beyond a factual description of the scene composition, such as sentiments. Such prior work relies on given sentiment identifiers, which are used to express a certain global style in the caption, e.g. positive or negative, however without taking into account the stylistic content of the visual scene. To address this shortcoming, we first analyze the limitations of current stylized captioning datasets and propose COCO attribute-based augmentations to obtain varied stylized captions from COCO annotations. Furthermore, we encode the stylized information in the latent space of a Variational Autoencoder; specifically, we leverage extracted image attributes to explicitly structure its sequential latent space according to different localized style characteristics. Our experiments on the Senticap and COCO datasets show the ability of our approach to generate accurate captions with diversity in styles that are grounded in the image.
翻译:先前工作显示的Stylized 图像字幕旨在产生说明,这些说明反映的特征超出了真实描述现场构成的特征,例如情感。这些先前的工作依赖于特定感官识别符号,这些识别符号用于在标题中表达某种全球风格,例如正或负,但并不考虑视觉场景的文体内容。为了解决这一缺陷,我们首先分析当前Styl化字幕数据集的局限性,并提出COCOCO属性增强功能,以便从COCO注解中获取不同的标准化字幕。此外,我们还在Variational Autencoder的潜层中编码了Stylized信息;具体地说,我们利用提取图像属性,根据不同的本地风格特征明确构建其连续的潜层空间。我们在Senticap和CO数据集上的实验表明,我们的方法能够产生精确的字幕,以基于图像的风格的多样性为基础。