We present UniTune, a simple and novel method for general text-driven image editing. UniTune gets as input an arbitrary image and a textual edit description, and carries out the edit while maintaining high semantic and visual fidelity to the input image. UniTune uses text, an intuitive interface for art-direction, and does not require additional inputs, like masks or sketches. At the core of our method is the observation that with the right choice of parameters, we can fine-tune a large text-to-image diffusion model on a single image, encouraging the model to maintain fidelity to the input image while still allowing expressive manipulations. We used Imagen as our text-to-image model, but we expect UniTune to work with other large-scale models as well. We test our method in a range of different use cases, and demonstrate its wide applicability.
翻译:我们展示了 UniTune, 这是用于文本驱动的图像编辑的简单和新颖的方法。 UniTune 将任意的图像和文字编辑描述作为输入输入, 并在对输入图像保持高度的语义和视觉忠诚的同时进行编辑。 UniTune 使用文字, 即艺术方向的直观界面, 不需要额外的输入, 比如面罩或草图。 我们方法的核心是观察, 在正确选择参数的情况下, 我们可以微调一个大型文本到图像在单一图像上的传播模型, 鼓励该模型保持对输入图像的忠诚, 同时仍然允许表达操作 。 我们用图像作为文本到图像的模型, 但我们期望 UniTune 与其他大型模型一起工作。 我们在不同的使用案例中测试我们的方法, 并展示其广泛应用性 。