Video diffusion models have achieved impressive results in natural scene generation, yet they struggle to generalize to scientific phenomena such as fluid simulations and meteorological processes, where underlying dynamics are governed by scientific laws. These tasks pose unique challenges, including severe domain gaps, limited training data, and the lack of descriptive language annotations. To handle this dilemma, we extracted the latent scientific phenomena knowledge and further proposed a fresh framework that teaches video diffusion models to generate scientific phenomena from a single initial frame. Particularly, static knowledge is extracted via pre-trained masked autoencoders, while dynamic knowledge is derived from pre-trained optical flow prediction. Subsequently, based on the aligned spatial relations between the CLIP vision and language encoders, the visual embeddings of scientific phenomena, guided by latent scientific phenomena knowledge, are projected to generate the pseudo-language prompt embeddings in both spatial and frequency domains. By incorporating these prompts and fine-tuning the video diffusion model, we enable the generation of videos that better adhere to scientific laws. Extensive experiments on both computational fluid dynamics simulations and real-world typhoon observations demonstrate the effectiveness of our approach, achieving superior fidelity and consistency across diverse scientific scenarios.
翻译:视频扩散模型在自然场景生成方面已取得显著成果,但在流体模拟与气象过程等科学现象生成任务中泛化能力有限,因其动态演化受科学定律严格约束。此类任务面临独特挑战:严重的领域差异、训练数据稀缺以及缺乏描述性语言标注。为解决这一难题,我们提取了潜在科学现象知识,并提出一种新颖框架,指导视频扩散模型从单帧初始图像生成科学现象。具体而言,静态知识通过预训练掩码自编码器提取,动态知识则源自预训练光流预测模型。随后,基于CLIP视觉与语言编码器间的空间关系对齐,在潜在科学现象知识引导下,将科学现象的视觉嵌入映射至空间域与频域,生成伪语言提示嵌入。通过融合这些提示并微调视频扩散模型,我们实现了更符合科学定律的视频生成。在计算流体动力学仿真与真实台风观测数据上的大量实验验证了本方法的有效性,其在多种科学场景中均展现出优异的保真度与一致性。