Existing state-of-the-art method for audio-visual conditioned video prediction uses the latent codes of the audio-visual frames from a multimodal stochastic network and a frame encoder to predict the next visual frame. However, a direct inference of per-pixel intensity for the next visual frame from the latent codes is extremely challenging because of the high-dimensional image space. To this end, we propose to decouple the audio-visual conditioned video prediction into motion and appearance modeling. The first part is the multimodal motion estimation module that learns motion information as optical flow from the given audio-visual clip. The second part is the context-aware refinement module that uses the predicted optical flow to warp the current visual frame into the next visual frame and refines it base on the given audio-visual context. Experimental results show that our method achieves competitive results on existing benchmarks.
翻译:现有最新的视听定性视频预测方法使用来自多式视听网络和框架编码器的视听框架潜在代码来预测下一个视觉框架,然而,由于高维图像空间,从潜在代码中直接推断下一个视觉框架的人均像素强度极具挑战性。为此,我们建议将视听视频附加视频预测与运动和外观模型脱钩。第一部分是多式运动估计模块,该模块将运动信息作为来自特定视听剪辑的光学流来学习。第二部分是背景认知改进模块,该模块利用预测的光学流将当前视觉框架转换为下一个视觉框架,并根据特定视听背景加以完善。实验结果表明,我们的方法在现有的基准上取得了竞争性成果。