DreaMontage：基于任意帧引导的单镜头视频生成 (DreaMontage: Arbitrary Frame-Guided One-Shot Video Generation)

Jiawei Liu,Junqiao Li,Jiangfan Deng,Gen Li,Siyu Zhou,Zetao Fang,Shanshan Lao,Zengde Deng,Jianing Zhu,Tingting Ma,Jiayi Li,Yunqiu Wang,Qian He,Xinglong Wu

from arxiv, Project Page: https://dreamontage.github.io/DreaMontage/

The "one-shot" technique represents a distinct and sophisticated aesthetic in filmmaking. However, its practical realization is often hindered by prohibitive costs and complex real-world constraints. Although emerging video generation models offer a virtual alternative, existing approaches typically rely on naive clip concatenation, which frequently fails to maintain visual smoothness and temporal coherence. In this paper, we introduce DreaMontage, a comprehensive framework designed for arbitrary frame-guided generation, capable of synthesizing seamless, expressive, and long-duration one-shot videos from diverse user-provided inputs. To achieve this, we address the challenge through three primary dimensions. (i) We integrate a lightweight intermediate-conditioning mechanism into the DiT architecture. By employing an Adaptive Tuning strategy that effectively leverages base training data, we unlock robust arbitrary-frame control capabilities. (ii) To enhance visual fidelity and cinematic expressiveness, we curate a high-quality dataset and implement a Visual Expression SFT stage. In addressing critical issues such as subject motion rationality and transition smoothness, we apply a Tailored DPO scheme, which significantly improves the success rate and usability of the generated content. (iii) To facilitate the production of extended sequences, we design a Segment-wise Auto-Regressive (SAR) inference strategy that operates in a memory-efficient manner. Extensive experiments demonstrate that our approach achieves visually striking and seamlessly coherent one-shot effects while maintaining computational efficiency, empowering users to transform fragmented visual materials into vivid, cohesive one-shot cinematic experiences.

翻译："单镜头"技术代表了电影制作中一种独特而精妙的美学风格。然而，其实际实现往往受限于高昂的成本和复杂的现实约束。尽管新兴的视频生成模型提供了虚拟替代方案，但现有方法通常依赖于简单的片段拼接，这往往难以保持视觉流畅性和时间连贯性。本文提出了DreaMontage，一个为任意帧引导生成设计的综合性框架，能够从用户提供的多样化输入中合成无缝、富有表现力且时长长的单镜头视频。为实现这一目标，我们从三个主要维度解决这一挑战。(i) 我们在DiT架构中集成了一个轻量级的中间条件调节机制。通过采用能有效利用基础训练数据的自适应调优策略，我们解锁了强大的任意帧控制能力。(ii) 为提升视觉保真度和电影表现力，我们构建了一个高质量数据集并实施了视觉表达监督微调阶段。针对主体运动合理性和过渡平滑性等关键问题，我们应用了定制的直接偏好优化方案，显著提高了生成内容的成功率和可用性。(iii) 为促进长序列的生成，我们设计了一种以内存高效方式运行的片段式自回归推理策略。大量实验表明，我们的方法在保持计算效率的同时，实现了视觉震撼且无缝连贯的单镜头效果，使用户能够将零散的视觉素材转化为生动、连贯的单镜头电影体验。