We tackle the task of stylizing video objects in an intuitive and semantic manner following a user-specified text prompt. This is a challenging task as the resulting video must satisfy multiple properties: (1) it has to be temporally consistent and avoid jittering or similar artifacts, (2) the resulting stylization must preserve both the global semantics of the object and its fine-grained details, and (3) it must adhere to the user-specified text prompt. To this end, our method stylizes an object in a video according to two target texts. The first target text prompt describes the global semantics and the second target text prompt describes the local semantics. To modify the style of an object, we harness the representational power of CLIP to get a similarity score between (1) the local target text and a set of local stylized views, and (2) a global target text and a set of stylized global views. We use a pretrained atlas decomposition network to propagate the edits in a temporally consistent manner. We demonstrate that our method can generate consistent style changes over time for a variety of objects and videos, that adhere to the specification of the target texts. We also show how varying the specificity of the target texts and augmenting the texts with a set of prefixes results in stylizations with different levels of detail. Full results are given on our project webpage: https://sloeschcke.github.io/Text-Driven-Stylization-of-Video-Objects/
翻译:在用户指定的文本提示下,我们用直观和语义的方式处理视频对象的同步化任务。 这是一项艰巨的任务, 因为由此产生的视频必须满足多种属性:(1) 它必须具有时间一致性, 避免乱七八糟或类似的工艺品; (2) 由此形成的语义化必须同时保存该对象的全球语义及其精细精细细节; (3) 它必须遵循用户指定的文本提示。 为此, 我们的方法根据两个目标文本在视频中将一个对象按两个目标文本拼贴。 第一个目标文本提示描述全球语义化, 第二个目标文本则描述本地语义学。 为了修改一个对象的风格, 我们使用 CLIP 的表达力来获得相似的分数:(1) 本地目标文本和一套本地语义化观点; (2) 全球目标文本和一组标准化的全球观点。 我们使用一个预设的 atlib decomposition 网络以时间一致的方式传播编辑内容。 我们演示我们的方法可以生成全方位的风格变化, 并显示不同目标文本的文本的精度。