Videos of actions are complex signals containing rich compositional structure in space and time. Current video generation methods lack the ability to condition the generation on multiple coordinated and potentially simultaneous timed actions. To address this challenge, we propose to represent the actions in a graph structure called Action Graph and present the new ``Action Graph To Video'' synthesis task. Our generative model for this task (AG2Vid) disentangles motion and appearance features, and by incorporating a scheduling mechanism for actions facilitates a timely and coordinated video generation. We train and evaluate AG2Vid on the CATER and Something-Something V2 datasets, and show that the resulting videos have better visual quality and semantic consistency compared to baselines. Finally, our model demonstrates zero-shot abilities by synthesizing novel compositions of the learned actions. For code and pretrained models, see the project page https://roeiherz.github.io/AG2Video
翻译:行动视频是复杂的信号,包含丰富的空间和时间构成结构。 当前视频生成方法缺乏使生成过程以多重协调和可能同时的同步动作为条件的能力。 为了应对这一挑战, 我们提议在名为“ 行动图” 的图表结构中代表动作, 并将新的“ 行动图” 合成任务展示出来。 我们的任务的基因模型( AG2Vid) 分解了运动和外观特征, 并纳入行动时间安排机制, 从而有利于及时和协调的视频生成。 我们在CATER 和 Something V2 数据集上对 AG2Vid 进行了培训和评估, 并表明由此产生的视频与基线相比具有更好的视觉质量和语义一致性。 最后, 我们的模型通过将所学到的行动的新组合合成合成合成来展示零光能力。 关于代码和预训练模型, 请参见项目网页 https://roeiherz.github. / AG2Video.