Contrastive Language-Image Pre-Training (CLIP) has refreshed the state of the art for a broad range of vision-language cross-modal tasks. Particularly, it has created an intriguing research line of text-guided image style transfer, dispensing with the need for style reference images as in traditional style transfer methods. However, directly using CLIP to guide the transfer of style leads to undesirable artifacts (mainly written words and unrelated visual entities) spread over the image, partly due to the entanglement of visual and written concepts inherent in CLIP. Inspired by the use of spectral analysis in filtering linguistic information at different granular levels, we analyse the patch embeddings from the last layer of the CLIP vision encoder from the perspective of spectral analysis and find that the presence of undesirable artifacts is highly correlated to some certain frequency components. We propose SpectralCLIP, which implements a spectral filtering layer on top of the CLIP vision encoder, to alleviate the artifact issue. Experimental results show that SpectralCLIP prevents the generation of artifacts effectively in quantitative and qualitative terms, without impairing the stylisation quality. We further apply SpectralCLIP to text-conditioned image generation and show that it prevents written words in the generated images. Code is available at https://github.com/zipengxuc/SpectralCLIP.
翻译:培训前语言对比图像(CLIP)更新了各种视觉语言跨模式任务的最新水平。 特别是,它创造了一种令人感兴趣的文本引导图像风格传输的研究线,避免了像传统风格传输方法那样对样式参考图像的需求。 但是,直接使用 CLIP来指导风格转换导致在图像上传播不受欢迎的文物(主要是书面文字和不相干视觉实体),部分是由于CLIP所固有的视觉和书面概念的纠缠。 借助光谱分析在不同颗粒级别过滤语言信息时的启发,我们从光谱分析的角度分析了CLIP图像编码最后一层的补丁嵌入线,发现不受欢迎的文物的存在与某些频率组成部分密切相关。 我们提议SpectralCLIP在CLIP视觉编码顶端执行一个光谱过滤层,以缓解工艺品问题。 实验结果表明,SpectralCLIP在不同颗粒/质图像的生成方面有效防止了艺术作品的生成,在定量和定性版本中,我们不破坏了生成的文本。</s>