Video represents the majority of internet traffic today leading to a continuous technological arms race between generating higher quality content, transmitting larger file sizes and supporting network infrastructure. Adding to this is the recent COVID-19 pandemic fueled surge in the use of video conferencing tools. Since videos take up substantial bandwidth (~100 Kbps to few Mbps), improved video compression can have a substantial impact on network performance for live and pre-recorded content, providing broader access to multimedia content worldwide. In this work, we present a novel video compression pipeline, called Txt2Vid, which substantially reduces data transmission rates by compressing webcam videos ("talking-head videos") to a text transcript. The text is transmitted and decoded into a realistic reconstruction of the original video using recent advances in deep learning based voice cloning and lip syncing models. Our generative pipeline achieves two to three orders of magnitude reduction in the bitrate as compared to the standard audio-video codecs (encoders-decoders), while maintaining equivalent Quality-of-Experience based on a subjective evaluation by users (n=242) in an online study. The Txt2Vid framework opens up the potential for creating novel applications such as enabling audio-video communication during poor internet connectivity, or in remote terrains with limited bandwidth. The code for this work is available at https://github.com/tpulkit/txt2vid.git.
翻译:视频压缩代表了今天大部分互联网流量,导致在产生更高质量的内容、传送更大型的文件规模和支持网络基础设施之间持续的技术军备竞赛。 此外,最近COVID-19 流行导致视频会议工具使用激增。由于视频使用大量带宽(~100Kbps至少数Mbps),改进视频压缩可以对现场和预先录制内容的网络性能产生重大影响,从而更广泛地提供全球范围多媒体内容。在这项工作中,我们展示了一个新的视频压缩管道,称为Txt2Vid,通过将网络摄像头视频(“跟踪头视频”)压缩到文本誊本,大大降低了数据传输率。由于视频在深层次的语音克隆和唇同步模型方面的最新进步,该文本被传输和破解为现实地重建原始视频。我们的图像压缩管道比标准视频代码(encoders-decoders)实现了两到三级的降级,同时保持了基于用户主观评价(n=242)大幅的质量,从而大幅降低数据传输率。在互联网的远程访问中,在远程访问中,这种基础应用中,在可提供有限的网络链接的网络链接/带宽度应用中,在远程访问中,在远程访问中可以产生。Trevixxreabiltrebiltrefiltrefiltrefiltal2xrefilmalalalalbil