To unlock video chat for hundreds of millions of people hindered by poor connectivity or unaffordable data costs, we propose to authentically reconstruct faces on the receiver's device using facial landmarks extracted at the sender's side and transmitted over the network. In this context, we discuss and evaluate the benefits and disadvantages of several deep adversarial approaches. In particular, we explore quality and bandwidth trade-offs for approaches based on static landmarks, dynamic landmarks or segmentation maps. We design a mobile-compatible architecture based on the first order animation model of Siarohin et al. In addition, we leverage SPADE blocks to refine results in important areas such as the eyes and lips. We compress the networks down to about 3MB, allowing models to run in real time on iPhone 8 (CPU). This approach enables video calling at a few kbits per second, an order of magnitude lower than currently available alternatives.
翻译:为了让数以亿计因连接不畅通或数据成本低廉而受到阻碍的人开通视频聊天,我们提议利用发件人一侧提取并通过网络传送的面部标志,真实地重建接收器设备上的脸孔。在这方面,我们讨论和评价若干深层对立方法的利弊。特别是,我们探索以静态地标、动态地标或分割图为基础的方法的质量和带宽取舍。我们根据Siarohin等人的第一顺序动画模型设计了一个可移动兼容的结构。此外,我们利用SPADE块来改进重要领域,例如眼睛和嘴唇的结果。我们将网络压缩到大约3MB,允许iPhone 8(CPU)上实时运行模型。这个方法使得视频每秒能用几千位,比特,比目前可用的替代方法要低一个数量级。