Multi-modal domain translation typically refers to synthesizing a novel image that inherits certain localized attributes from a 'content' image (e.g. layout, semantics, or geometry), and inherits everything else (e.g. texture, lighting, sometimes even semantics) from a 'style' image. The dominant approach to this task is attempting to learn disentangled 'content' and 'style' representations from scratch. However, this is not only challenging, but ill-posed, as what users wish to preserve during translation varies depending on their goals. Motivated by this inherent ambiguity, we define 'content' based on conditioning information extracted by off-the-shelf pre-trained models. We then train our style extractor and image decoder with an easy to optimize set of reconstruction objectives. The wide variety of high-quality pre-trained models available and simple training procedure makes our approach straightforward to apply across numerous domains and definitions of 'content'. Additionally it offers intuitive control over which aspects of 'content' are preserved across domains. We evaluate our method on traditional, well-aligned, datasets such as CelebA-HQ, and propose two novel datasets for evaluation on more complex scenes: ClassicTV and FFHQ-Wild. Our approach, Sensorium, enables higher quality domain translation for more complex scenes.
翻译:多式域翻译通常指将新颖图像合成,该图像从“ 内容” 图像( 如布局、 语义或几何学) 中继承某些本地化属性( 例如布局、 语义或几何学), 从“ 风格” 图像中继承其他一切( 例如纹理、 照明, 有时甚至语义) 。 这项任务的主要方法是从零开始学习分解的“ 内容” 和“ 风格” 表达方式。 然而, 这不仅具有挑战性, 而且是错误的, 因为用户在翻译过程中希望保存的内容取决于他们的目标。 受这种固有的模糊性驱使, 我们定义“ 内容” 是基于从现成的预先训练模式中提取的信息的。 然后我们用一种最易优化的重建目标组合来训练我们的风格选取和图像解码。 各种高质量的预先训练前模式和简单培训程序使我们的方法能够直接适用于多个领域和“ 内容” 定义。 此外,它提供了对“ 内容” 的哪些方面保存在跨域中的直观控制。 我们评估了我们的传统、 深层次、 质量 和高端端域的翻译方法, 用于我们更复杂版本的数据。