Virtual humans have gained considerable attention in numerous industries, e.g., entertainment and e-commerce. As a core technology, synthesizing photorealistic face frames from target speech and facial identity has been actively studied with generative adversarial networks. Despite remarkable results of modern talking-face generation models, they often entail high computational burdens, which limit their efficient deployment. This study aims to develop a lightweight model for speech-driven talking-face synthesis. We build a compact generator by removing the residual blocks and reducing the channel width from Wav2Lip, a popular talking-face generator. We also present a knowledge distillation scheme to stably yet effectively train the small-capacity generator without adversarial learning. We reduce the number of parameters and MACs by 28$\times$ while retaining the performance of the original model. Moreover, to alleviate a severe performance drop when converting the whole generator to INT8 precision, we adopt a selective quantization method that uses FP16 for the quantization-sensitive layers and INT8 for the other layers. Using this mixed precision, we achieve up to a 19$\times$ speedup on edge GPUs without noticeably compromising the generation quality.
翻译:虚拟人在众多产业中得到了广泛的关注,例如娱乐和电子商务。合成具有照片级逼真度的面部帧的核心技术之一是利用生成对抗网络从目标语音和面部身份生成说话人面部表情。尽管现代说话人生成模型取得了显著的结果,但它们通常会带来较高的计算负担,限制了它们的高效部署。本研究旨在开发一种轻量级的语音驱动说话人合成模型。我们通过从著名的说话人生成器Wav2Lip中去除残差块和减少通道宽度来构建紧凑的生成器。我们还提出了一种知识蒸馏方法,以稳定而有效地训练小容量生成器而无需使用对抗式学习。我们在保留原模型性能的同时,将参数数量和MACs减少了28倍。此外,为了缓解整个生成器转换为INT8精度时存在的严重性能下降问题,我们采用了一种选择性量化方法,使用FP16量化敏感层,其他层使用INT8。使用此混合精度,我们在边缘GPU上实现了最高19倍的加速,而不会明显影响生成质量。