We propose a novel deep multi-modality neural network for restoring very low bit rate videos of talking heads. Such video contents are very common in social media, teleconferencing, distance education, tele-medicine, etc., and often need to be transmitted with limited bandwidth. The proposed CNN method exploits the correlations among three modalities, video, audio and emotion state of the speaker, to remove the video compression artifacts caused by spatial down sampling and quantization. The deep learning approach turns out to be ideally suited for the video restoration task, as the complex non-linear cross-modality correlations are very difficult to model analytically and explicitly. The new method is a video post processor that can significantly boost the perceptual quality of aggressively compressed talking head videos, while being fully compatible with all existing video compression standards.
 翻译:我们建议建立新型的多模式神经网络,以恢复低位说话头视频。 这种视频内容在社交媒体、电话会议、远程教育、远程医疗等中非常常见,而且往往需要带宽有限。 拟议的CNN方法利用了发言者的视频、音频和情绪等三种模式之间的相互关系,删除空间下游取样和量化造成的视频压缩工艺品。 深层次学习方法被证明最适合于视频恢复任务,因为复杂的非线性跨模式相关关系很难在分析上和明确进行模拟。 新的方法是一种视频后处理器,可以大大提升激烈压缩的语音头部视频的感官质量,同时完全符合所有现有的视频压缩标准。