Recent strides in video generation have paved the way for unified audio-visual generation. In this work, we present Seedance 1.5 pro, a foundational model engineered specifically for native, joint audio-video generation. Leveraging a dual-branch Diffusion Transformer architecture, the model integrates a cross-modal joint module with a specialized multi-stage data pipeline, achieving exceptional audio-visual synchronization and superior generation quality. To ensure practical utility, we implement meticulous post-training optimizations, including Supervised Fine-Tuning (SFT) on high-quality datasets and Reinforcement Learning from Human Feedback (RLHF) with multi-dimensional reward models. Furthermore, we introduce an acceleration framework that boosts inference speed by over 10X. Seedance 1.5 pro distinguishes itself through precise multilingual and dialect lip-syncing, dynamic cinematic camera control, and enhanced narrative coherence, positioning it as a robust engine for professional-grade content creation. Seedance 1.5 pro is now accessible on Volcano Engine at https://console.volcengine.com/ark/region:ark+cn-beijing/experience/vision?type=GenVideo.
翻译:近期视频生成领域的进展为统一的音视频生成铺平了道路。在本工作中,我们提出了Seedance 1.5 pro,这是一个专门为原生、联合的音视频生成而设计的基础模型。该模型利用双分支Diffusion Transformer架构,集成了跨模态联合模块与专门的多阶段数据流水线,实现了卓越的音画同步与优异的生成质量。为确保其实用性,我们实施了精细的后训练优化,包括在高质量数据集上进行监督微调,以及利用多维奖励模型进行基于人类反馈的强化学习。此外,我们引入了一个加速框架,将推理速度提升了超过10倍。Seedance 1.5 pro凭借其精确的多语言与方言口型同步、动态的电影级镜头控制以及增强的叙事连贯性而脱颖而出,使其成为专业级内容创作的强大引擎。Seedance 1.5 pro现已在火山引擎上开放访问,地址为:https://console.volcengine.com/ark/region:ark+cn-beijing/experience/vision?type=GenVideo。