We tackle the task of stylizing video objects in an intuitive and semantic manner following a user-specified text prompt. This is a challenging task as the resulting video must satisfy multiple properties: (1) it has to be temporally consistent and avoid jittering or similar artifacts, (2) the resulting stylization must preserve both the global semantics of the object and its fine-grained details, and (3) it must adhere to the user-specified text prompt. To this end, our method stylizes an object in a video according to a global target text prompt that describes the global semantics and a local target text prompt that describes the local semantics. To modify the style of an object, we harness the representational power of CLIP to get a similarity score between (1) the local target text and a set of local stylized views, and (2) a global target text and a set of stylized global views. We use a pretrained atlas decomposition network to propagate the edits in a temporally consistent manner. We demonstrate that our method can generate consistent style changes in time for a variety of objects and videos, that adhere to the specification of the target texts. We also show how varying the specificity of the target texts, and augmenting the texts with a set of prefixes results in stylizations with different levels of detail. Full results are given on our project webpage: https://sloeschcke.github.io/Text-Driven-Stylization-of-Video-Objects/
翻译:在用户指定的文本提示下,我们用直观和语义的方式处理视频对象的拼写任务。 这是一项艰巨的任务, 因为由此产生的视频必须满足多种属性:(1) 它必须具有时间一致性, 避免乱七八糟或类似的工艺品; (2) 由此形成的Styl化必须同时保存该对象的全球语义及其精细细微细详细内容; (3) 它必须遵循用户指定的文本提示。 为此, 我们的方法根据描述全球语义和描述本地语义的全球性目标文本提示, 在一个视频中将一个对象拼写成一个对象。 我们证明, 我们的方法可以在时间上对一个对象的样式进行一致的修改, 我们利用 CLIP 的表达力来获得相似的评分, 在(1) 本地目标文本和一套本地语义化观点之间, (2) 全球目标文本和一套标准化的全球观点之间。 我们使用一个预设的 atlas decommation 网络, 以时间化的方式传播编辑结果 。 我们证明, 我们的方法可以在时间化时序中生成不同文本和视频目标的详细度的完整格式变化的风格变化, 我们也可以显示各种文本的文本和图像的规格, 设置, 。