The "one-shot" technique represents a distinct and sophisticated aesthetic in filmmaking. However, its practical realization is often hindered by prohibitive costs and complex real-world constraints. Although emerging video generation models offer a virtual alternative, existing approaches typically rely on naive clip concatenation, which frequently fails to maintain visual smoothness and temporal coherence. In this paper, we introduce DreaMontage, a comprehensive framework designed for arbitrary frame-guided generation, capable of synthesizing seamless, expressive, and long-duration one-shot videos from diverse user-provided inputs. To achieve this, we address the challenge through three primary dimensions. (i) We integrate a lightweight intermediate-conditioning mechanism into the DiT architecture. By employing an Adaptive Tuning strategy that effectively leverages base training data, we unlock robust arbitrary-frame control capabilities. (ii) To enhance visual fidelity and cinematic expressiveness, we curate a high-quality dataset and implement a Visual Expression SFT stage. In addressing critical issues such as subject motion rationality and transition smoothness, we apply a Tailored DPO scheme, which significantly improves the success rate and usability of the generated content. (iii) To facilitate the production of extended sequences, we design a Segment-wise Auto-Regressive (SAR) inference strategy that operates in a memory-efficient manner. Extensive experiments demonstrate that our approach achieves visually striking and seamlessly coherent one-shot effects while maintaining computational efficiency, empowering users to transform fragmented visual materials into vivid, cohesive one-shot cinematic experiences.
翻译:"单镜头"技术代表了电影制作中一种独特而精妙的美学风格。然而,其实际实现往往受限于高昂的成本和复杂的现实约束。尽管新兴的视频生成模型提供了虚拟替代方案,但现有方法通常依赖于简单的片段拼接,这往往难以保持视觉流畅性和时间连贯性。本文提出了DreaMontage,一个为任意帧引导生成设计的综合性框架,能够从用户提供的多样化输入中合成无缝、富有表现力且时长长的单镜头视频。为实现这一目标,我们从三个主要维度解决这一挑战。(i) 我们在DiT架构中集成了一个轻量级的中间条件调节机制。通过采用能有效利用基础训练数据的自适应调优策略,我们解锁了强大的任意帧控制能力。(ii) 为提升视觉保真度和电影表现力,我们构建了一个高质量数据集并实施了视觉表达监督微调阶段。针对主体运动合理性和过渡平滑性等关键问题,我们应用了定制的直接偏好优化方案,显著提高了生成内容的成功率和可用性。(iii) 为促进长序列的生成,我们设计了一种以内存高效方式运行的片段式自回归推理策略。大量实验表明,我们的方法在保持计算效率的同时,实现了视觉震撼且无缝连贯的单镜头效果,使用户能够将零散的视觉素材转化为生动、连贯的单镜头电影体验。