Video personalization aims to generate videos that faithfully reflect a user-provided subject while following a text prompt. However, existing approaches often rely on heavy video-based finetuning or large-scale video datasets, which impose substantial computational cost and are difficult to scale. Furthermore, they still struggle to maintain fine-grained appearance consistency across frames. To address these limitations, we introduce V-Warper, a training-free coarse-to-fine personalization framework for transformer-based video diffusion models. The framework enhances fine-grained identity fidelity without requiring any additional video training. (1) A lightweight coarse appearance adaptation stage leverages only a small set of reference images, which are already required for the task. This step encodes global subject identity through image-only LoRA and subject-embedding adaptation. (2) A inference-time fine appearance injection stage refines visual fidelity by computing semantic correspondences from RoPE-free mid-layer query--key features. These correspondences guide the warping of appearance-rich value representations into semantically aligned regions of the generation process, with masking ensuring spatial reliability. V-Warper significantly improves appearance fidelity while preserving prompt alignment and motion dynamics, and it achieves these gains efficiently without large-scale video finetuning.
翻译:视频个性化旨在生成既忠实反映用户提供的主体,又遵循文本提示的视频。然而,现有方法通常依赖于繁重的视频微调或大规模视频数据集,这带来了巨大的计算成本且难以扩展。此外,它们仍难以在帧间保持细粒度的外观一致性。为解决这些局限性,我们提出了V-Warper,一种面向基于Transformer的视频扩散模型的无训练、由粗到精的个性化框架。该框架无需任何额外的视频训练即可增强细粒度的身份保真度。(1) 轻量级的粗粒度外观适应阶段仅利用少量参考图像(这些图像已是任务所需),通过仅图像的LoRA和主体嵌入适应来编码全局主体身份。(2) 推理时的细粒度外观注入阶段通过计算来自无RoPE的中间层查询-键特征的语义对应关系来优化视觉保真度。这些对应关系引导外观丰富的值表示扭曲到生成过程的语义对齐区域,并通过掩码确保空间可靠性。V-Warper在保持提示对齐和运动动态的同时,显著提升了外观保真度,且无需大规模视频微调即可高效实现这些增益。