Image captioning has increasingly large domains of application, and fashion is not an exception. Having automatic item descriptions is of great interest for fashion web platforms hosting sometimes hundreds of thousands of images. This paper is one of the first tackling image captioning for fashion images. To contribute addressing dataset diversity issues, we introduced the InFashAIv1 dataset containing almost 16.000 African fashion item images with their titles, prices and general descriptions. We also used the well known DeepFashion dataset in addition to InFashAIv1. Captions are generated using the Show and Tell model made of CNN encoder and RNN Decoder. We showed that jointly training the model on both datasets improves captions quality for African style fashion images, suggesting a transfer learning from Western style data. The InFashAIv1 dataset is released on Github to encourage works with more diversity inclusion.
翻译:图像字幕的应用领域越来越大, 时尚也不例外。 自动项目描述对时装网络平台有时有数十万张图像托管非常感兴趣。 本文是第一批处理时装图像图像字幕的首个。 为了帮助解决数据集多样性问题, 我们引入了InFashAIv1数据集, 包含近16000个非洲时装项目图片及其标题、 价格和一般描述。 除了 InFashAIv1 外, 我们还使用众所周知的 DeepFashianxion数据集。 标题是使用CNN 编码器和 RNNN Decoder 制作的显示和 Tell 模型生成的。 我们显示, 联合培训这两个数据集的模型可以提高非洲时装图像的描述质量, 建议从西方时装数据中进行传输学习。 InFashAIv1 数据集在 Github 上发布, 以鼓励以更多多样性包容的方式开展工作 。