Fashion-image editing represents a challenging computer vision task, where the goal is to incorporate selected apparel into a given input image. Most existing techniques, known as Virtual Try-On methods, deal with this task by first selecting an example image of the desired apparel and then transferring the clothing onto the target person. Conversely, in this paper, we consider editing fashion images with text descriptions. Such an approach has several advantages over example-based virtual try-on techniques, e.g.: (i) it does not require an image of the target fashion item, and (ii) it allows the expression of a wide variety of visual concepts through the use of natural language. Existing image-editing methods that work with language inputs are heavily constrained by their requirement for training sets with rich attribute annotations or they are only able to handle simple text descriptions. We address these constraints by proposing a novel text-conditioned editing model, called FICE (Fashion Image CLIP Editing), capable of handling a wide variety of diverse text descriptions to guide the editing procedure. Specifically with FICE, we augment the common GAN inversion process by including semantic, pose-related, and image-level constraints when generating images. We leverage the capabilities of the CLIP model to enforce the semantics, due to its impressive image-text association capabilities. We furthermore propose a latent-code regularization technique that provides the means to better control the fidelity of the synthesized images. We validate FICE through rigorous experiments on a combination of VITON images and Fashion-Gen text descriptions and in comparison with several state-of-the-art text-conditioned image editing approaches. Experimental results demonstrate FICE generates highly realistic fashion images and leads to stronger editing performance than existing competing approaches.
翻译:时装图像编辑是一项具有挑战性的计算机视觉任务, 目标是将选定的服装纳入特定输入图像中。 大多数现有技术, 称为虚拟试镜方法, 首先是选择理想服装的示例图像, 然后将服装转换到目标对象。 相反, 在本文中, 我们考虑用文字描述来编辑时装图像。 这种方法比基于示例的虚拟试镜技术有几种优势, 例如:( 一) 它不需要目标时装项目的真实图像, 并且 (二) 它允许通过使用自然语言来表达各种各样的视觉时装概念。 现有的与语言输入相关的图像编辑方法由于对具有丰富的属性说明的培训组合的要求而受到严重制约, 或者它们只能处理简单的文本描述。 我们解决这些限制的方法是提出新的文本调整编辑模型, 称作 FICE (时装图像 CLIP 编辑), 能够处理多种不同的版本文本描述来指导当前的编辑程序。 具体地说, 我们通过使用自然语言语言进行比较, 将通用的 GAN 转换过程, 包括精度、 格式化、 格式和 图像水平 提升 C 工具 的升级 工具 能力, 我们通过智能 提升 C 工具 更新 工具, 更新 更新 更新 工具 更新 更新 工具 更新 工具 更新 工具 更新 更新 更新 更新 工具 更新 更新 更新 工具 更新 更新 更新 升级 升级 升级 工具 工具 升级 能力 。