In this work, we explore a new problem of frame interpolation for speech videos. Such content today forms the major form of online communication. We try to solve this problem by using several deep learning video generation algorithms to generate the missing frames. We also provide examples where computer vision models despite showing high performance on conventional non-linguistic metrics fail to accurately produce faithful interpolation of speech. With this motivation, we provide a new set of linguistically-informed metrics specifically targeted to the problem of speech videos interpolation. We also release several datasets to test computer vision video generation models of their speech understanding.
翻译:在这项工作中,我们探索了语言视频框架内插的新问题。今天,这种内容构成了在线通信的主要形式。我们试图通过使用一些深层次的视频生成算法来解决这个问题,以生成缺失的框。我们还提供了一些例子,说明计算机愿景模型尽管在常规非语言衡量标准上表现良好,却未能准确生成忠实的语音内插。有了这个动机,我们提供了一套新的语言信息计量标准,专门针对语言视频内插问题。我们还发布了一些数据集,以测试其语言理解的计算机视频生成模型。