FastLTS: 非自动后退、端对端、不受限制的口对口语音合成 (FastLTS: Non-Autoregressive End-to-End Unconstrained Lip-to-Speech Synthesis)

Unconstrained lip-to-speech synthesis aims to generate corresponding speeches from silent videos of talking faces with no restriction on head poses or vocabulary. Current works mainly use sequence-to-sequence models to solve this problem, either in an autoregressive architecture or a flow-based non-autoregressive architecture. However, these models suffer from several drawbacks: 1) Instead of directly generating audios, they use a two-stage pipeline that first generates mel-spectrograms and then reconstructs audios from the spectrograms. This causes cumbersome deployment and degradation of speech quality due to error propagation; 2) The audio reconstruction algorithm used by these models limits the inference speed and audio quality, while neural vocoders are not available for these models since their output spectrograms are not accurate enough; 3) The autoregressive model suffers from high inference latency, while the flow-based model has high memory occupancy: neither of them is efficient enough in both time and memory usage. To tackle these problems, we propose FastLTS, a non-autoregressive end-to-end model which can directly synthesize high-quality speech audios from unconstrained talking videos with low latency, and has a relatively small model size. Besides, different from the widely used 3D-CNN visual frontend for lip movement encoding, we for the first time propose a transformer-based visual frontend for this task. Experiments show that our model achieves $19.76\times$ speedup for audio waveform generation compared with the current autoregressive model on input sequences of 3 seconds, and obtains superior audio quality.

翻译：不受限制的嘴对嘴合成旨在从谈话面部的静默视频中产生相应的演讲,不限制头部或词汇。当前工作主要使用顺序到顺序模型来解决这个问题,无论是在自动递进结构还是以流动为基础的非自动递进结构中。然而,这些模型有几个缺点:(1) 与其直接生成音频,它们使用一个两阶段管道,先生成线谱仪,然后从光谱图中重建音频。这导致语音质量的部署和退化因传播错误而变得烦琐;(2) 这些模型使用的音频重建算法限制了推断速度和音频质量,而这些模型则无法使用神经蒸汽模型,因为它们的生成光谱不够准确;(3) 自动递进模式受到高推导力的偏差,而基于流模型的记忆占用率较高:在时间和记忆使用中,它们都没有足够有效的第一模型。为了解决这些问题,我们建议快速LTS,从不向下偏移的当前尾端到音频质量的音阶质量,我们用一个直径直径的直径直径直径的图像模型,可以直接将高压的音阶显示高压的音阶变的图像。

相关内容

MoDELS

关注 43

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

不可错过！《机器学习100讲》课程，UBC Mark Schmidt讲授

专知会员服务

76+阅读 · 2022年6月28日

50+篇《神经架构搜索NAS》2020论文合集

专知会员服务

61+阅读 · 2020年3月19日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

166+阅读 · 2020年3月18日

Aspect-Oriented Syntax Network for Aspect-Based Sentiment Analysis，中山大学数据科学与计算机学院权小军教授，第八届全国社会媒体处理大会SMP2019

专知会员服务

19+阅读 · 2019年10月22日