Video generation is a critical pathway toward world models, with efficient long video inference as a key capability. Toward this end, we introduce LongCat-Video, a foundational video generation model with 13.6B parameters, delivering strong performance across multiple video generation tasks. It particularly excels in efficient and high-quality long video generation, representing our first step toward world models. Key features include: Unified architecture for multiple tasks: Built on the Diffusion Transformer (DiT) framework, LongCat-Video supports Text-to-Video, Image-to-Video, and Video-Continuation tasks with a single model; Long video generation: Pretraining on Video-Continuation tasks enables LongCat-Video to maintain high quality and temporal coherence in the generation of minutes-long videos; Efficient inference: LongCat-Video generates 720p, 30fps videos within minutes by employing a coarse-to-fine generation strategy along both the temporal and spatial axes. Block Sparse Attention further enhances efficiency, particularly at high resolutions; Strong performance with multi-reward RLHF: Multi-reward RLHF training enables LongCat-Video to achieve performance on par with the latest closed-source and leading open-source models. Code and model weights are publicly available to accelerate progress in the field.
翻译:视频生成是实现世界模型的关键途径,其中高效的长视频推理是一项核心能力。为此,我们提出了 LongCat-Video,这是一个拥有 136 亿参数的基础视频生成模型,在多种视频生成任务中表现出色。它尤其在高效、高质量的长视频生成方面表现卓越,标志着我们迈向世界模型的第一步。主要特点包括:统一的多任务架构:基于 Diffusion Transformer (DiT) 框架构建,LongCat-Video 支持文本到视频、图像到视频和视频续写任务,且仅需单一模型;长视频生成:通过在视频续写任务上进行预训练,LongCat-Video 能够在生成长达数分钟的视频时保持高质量和时间连贯性;高效推理:LongCat-Video 采用时空维度的由粗到细生成策略,可在数分钟内生成 720p、30fps 的视频。块稀疏注意力机制进一步提升了效率,尤其是在高分辨率下;通过多奖励 RLHF 实现强大性能:多奖励 RLHF 训练使 LongCat-Video 的性能与最新的闭源及领先开源模型相当。代码和模型权重已公开,以加速该领域的进展。