Traditional animation production involves complex pipelines and significant manual labor cost. While recent video generation models such as Sora, Kling, and CogVideoX achieve impressive results on natural video synthesis, they exhibit notable limitations when applied to animation generation. Recent efforts, such as AniSora, demonstrate promising performance by fine-tuning image-to-video models for animation styles, yet analogous exploration in the text-to-video setting remains limited. In this work, we present PTTA, a pure text-to-animation framework for high-quality animation creation. We first construct a small-scale but high-quality paired dataset of animation videos and textual descriptions. Building upon the pretrained text-to-video model HunyuanVideo, we perform fine-tuning to adapt it to animation-style generation. Extensive visual evaluations across multiple dimensions show that the proposed approach consistently outperforms comparable baselines in animation video synthesis.
翻译:传统动画制作涉及复杂的流程和大量的人工成本。尽管近期如Sora、Kling和CogVideoX等视频生成模型在自然视频合成方面取得了令人瞩目的成果,但它们在应用于动画生成时仍表现出明显的局限性。近期的尝试(如AniSora)通过针对动画风格微调图像到视频模型,展现出有前景的性能,但在文本到视频设定下的类似探索仍然有限。本研究提出PTTA,一种用于高质量动画创作的纯文本到动画框架。我们首先构建了一个小规模但高质量的动画视频与文本描述配对数据集。基于预训练的文本到视频模型HunyuanVideo,我们通过微调使其适应动画风格生成。跨多个维度的广泛视觉评估表明,所提方法在动画视频合成方面持续优于可比基线。