Unconstrained lip-to-speech synthesis aims to generate corresponding speeches from silent videos of talking faces with no restriction on head poses or vocabulary. Current works mainly use sequence-to-sequence models to solve this problem, either in an autoregressive architecture or a flow-based non-autoregressive architecture. However, these models suffer from several drawbacks: 1) Instead of directly generating audios, they use a two-stage pipeline that first generates mel-spectrograms and then reconstructs audios from the spectrograms. This causes cumbersome deployment and degradation of speech quality due to error propagation; 2) The audio reconstruction algorithm used by these models limits the inference speed and audio quality, while neural vocoders are not available for these models since their output spectrograms are not accurate enough; 3) The autoregressive model suffers from high inference latency, while the flow-based model has high memory occupancy: neither of them is efficient enough in both time and memory usage. To tackle these problems, we propose FastLTS, a non-autoregressive end-to-end model which can directly synthesize high-quality speech audios from unconstrained talking videos with low latency, and has a relatively small model size. Besides, different from the widely used 3D-CNN visual frontend for lip movement encoding, we for the first time propose a transformer-based visual frontend for this task. Experiments show that our model achieves $19.76\times$ speedup for audio waveform generation compared with the current autoregressive model on input sequences of 3 seconds, and obtains superior audio quality.
翻译:不受限制的嘴对嘴合成旨在从谈话面部的静默视频中产生相应的演讲,不限制头部或词汇。当前工作主要使用顺序到顺序模型来解决这个问题,无论是在自动递进结构还是以流动为基础的非自动递进结构中。然而,这些模型有几个缺点:(1) 与其直接生成音频,它们使用一个两阶段管道,先生成线谱仪,然后从光谱图中重建音频。这导致语音质量的部署和退化因传播错误而变得烦琐;(2) 这些模型使用的音频重建算法限制了推断速度和音频质量,而这些模型则无法使用神经蒸汽模型,因为它们的生成光谱不够准确;(3) 自动递进模式受到高推导力的偏差,而基于流模型的记忆占用率较高:在时间和记忆使用中,它们都没有足够有效的第一模型。为了解决这些问题,我们建议快速LTS,从不向下偏移的当前尾端到音频质量的音阶质量,我们用一个直径直径的直径直径直径的图像模型,可以直接将高压的音阶显示高压的音阶变的图像。