Fashion illustration is used by designers to communicate their vision and to bring the design idea from conceptualization to realization, showing how clothes interact with the human body. In this context, computer vision can thus be used to improve the fashion design process. Differently from previous works that mainly focused on the virtual try-on of garments, we propose the task of multimodal-conditioned fashion image editing, guiding the generation of human-centric fashion images by following multimodal prompts, such as text, human body poses, and garment sketches. We tackle this problem by proposing a new architecture based on latent diffusion models, an approach that has not been used before in the fashion domain. Given the lack of existing datasets suitable for the task, we also extend two existing fashion datasets, namely Dress Code and VITON-HD, with multimodal annotations collected in a semi-automatic manner. Experimental results on these new datasets demonstrate the effectiveness of our proposal, both in terms of realism and coherence with the given multimodal inputs. Source code and collected multimodal annotations will be publicly released at: https://github.com/aimagelab/multimodal-garment-designer.
翻译:时尚插图被设计师用于传达他们的愿景,从概念化到实现,展示服装与人体的互动。在此背景下,计算机视觉可以用于改进时尚设计过程。与以前主要侧重于服装虚拟试穿的作品不同,我们提出多模态有条件的时尚图像编辑任务,通过遵循多模态提示(例如文本,人体姿势和服装草图)来指导生成以人为中心的时尚图像。我们通过提出一种基于潜在扩散模型的新架构来解决这个问题,这种方法在时尚领域中尚未被使用过。由于目前缺乏适合该任务的现有数据集,我们还通过半自动方式扩展了两个现有的时尚数据集,即Dress Code和VITON-HD,进行多模态注释。对这些新数据集的实验结果证明了我们的方案的有效性,无论从真实性还是与给定的多模态输入的一致性方面都如此。源代码和收集的多模态注释将在以下地址公开发布:https://github.com/aimagelab/multimodal-garment-designer。