Recent breakthroughs in Vision-Language (V&L) joint research have achieved remarkable results in various text-driven tasks. High-quality Text-to-video (T2V), a task that has been long considered mission-impossible, was proven feasible with reasonably good results in latest works. However, the resulting videos often have undesired artifacts largely because the system is purely data-driven and agnostic to the physical laws. To tackle this issue and further push T2V towards high-level physical realism, we present an autonomous data generation technique and a dataset, which intend to narrow the gap with a large number of multi-modal, 3D Text-to-Video/Simulation (T2V/S) data. In the dataset, we provide high-resolution 3D physical simulations for both solids and fluids, along with textual descriptions of the physical phenomena. We take advantage of state-of-the-art physical simulation methods (i) Incremental Potential Contact (IPC) and (ii) Material Point Method (MPM) to simulate diverse scenarios, including elastic deformations, material fractures, collisions, turbulence, etc. Additionally, high-quality, multi-view rendering videos are supplied for the benefit of T2V, Neural Radiance Fields (NeRF), and other communities. This work is the first step towards fully automated Text-to-Video/Simulation (T2V/S). Live examples and subsequent work are at https://sites.google.com/view/tpa-net.
翻译:在视觉-语言(V&L)联合研究中最近出现的突破在各种文本驱动的任务中取得了显著成果。高质量的文本到视频(T2V)这一长期被视为任务不可能的任务,事实证明是可行的,在最近的作品中取得了相当好的结果。然而,所产生的视频往往具有不受欢迎的文物,主要是因为该系统纯粹是数据驱动的,并且对物理法则具有不可知性。为了解决这一问题并进一步将T2V推向高层次物理现实主义,我们展示了自主数据生成技术和数据集,其目的是缩小与大量多模式、3D文本到视频/模拟(T2V/S)数据之间的差距。在数据集中,我们提供高分辨率的3D物理模拟,同时对物理法则进行文字描述。我们利用了最先进的物理模拟方法(i) 增量潜力联系(IPC) 和(ii) 材料点方法(MPM) 模拟多种模型,包括高分辨率的多版本、高分辨率的图像、高分辨率的实地变压/变压的图像。我们利用了高分辨率的图像/变压的图像/变现。