Text-based style transfer is a newly-emerging research topic that uses text information instead of style image to guide the transfer process, significantly extending the application scenario of style transfer. However, previous methods require extra time for optimization or text-image paired data, leading to limited effectiveness. In this work, we achieve a data-efficient text-based style transfer method that does not require optimization at the inference stage. Specifically, we convert text input to the style space of the pre-trained VGG network to realize a more effective style swap. We also leverage CLIP's multi-modal embedding space to learn the text-to-style mapping with the image dataset only. Our method can transfer arbitrary new styles of text input in real-time and synthesize high-quality artistic images.
翻译:基于文本的风格传输是一个新出现的研究课题,它使用文本信息而不是风格图像来指导传输过程,大大扩展了风格传输的应用设想。然而,以往的方法需要更多的时间来优化或文本图像配对数据,从而导致效果有限。在这项工作中,我们实现了一种基于数据效率的基于文本的风格传输方法,在推论阶段不需要优化。具体地说,我们将文本输入转换到经过预先培训的VGG网络的风格空间,以实现更有效的风格互换。我们还利用CLIP的多模式嵌入空间来学习仅使用图像数据集的文本到样式的绘图。我们的方法可以将文字输入的任意新风格转移到实时和合成高质量的艺术图像中。