Video represents the majority of internet traffic today leading to a continuous technological arms race between generating higher quality content, transmitting larger file sizes and supporting network infrastructure. Adding to this is the recent COVID-19 pandemic fueled surge in the use of video conferencing tools. Since videos take up substantial bandwidth (~100 Kbps to few Mbps), improved video compression can have a substantial impact on network performance for live and pre-recorded content, providing broader access to multimedia content worldwide. In this work, we present a novel video compression pipeline, called Txt2Vid, which substantially reduces data transmission rates by compressing webcam videos ("talking-head videos") to a text transcript. The text is transmitted and decoded into a realistic reconstruction of the original video using recent advances in deep learning based voice cloning and lip syncing models. Our generative pipeline achieves two to three orders of magnitude reduction in the bitrate as compared to the standard audio-video codecs (encoders-decoders), while maintaining equivalent Quality-of-Experience based on a subjective evaluation by users (n=242) in an online study. The code for this work is available at https://github.com/tpulkit/txt2vid.git.
翻译:视频压缩代表了今天大部分互联网流量,导致在产生更高质量的内容、传送更大型的文件规模和支持网络基础设施之间持续的技术军备竞赛。此外,最近COVID-19的流行助长了使用视频会议工具的激增。由于视频占用了大量带宽(~100Kbps至少数Mbps),改进的视频压缩可以对现场和预先录制内容的网络性能产生重大影响,为全世界更广泛地获取多媒体内容提供了更广泛的机会。在这项工作中,我们展示了一部新颖的视频压缩管道,称为Txt2Vid,通过压缩网络摄像头视频(“跟踪头部视频”)到文本誊本,大大降低了数据传输率。该文本被传输和解码,以现实地重建原始视频,利用了基于深层学习的语音克隆和唇式同步模型的最新进展。我们的基因化管道比特拉特(Bitratate)比标准音像调制解码器(encoders-decoders)减少了两到三级。同时维持基于用户主观评价(n=242)的同等质量和探索率。在网上研究中可以查到的 http://gigivex/ubt 。