In the articulatory synthesis task, speech is synthesized from input features containing information about the physical behavior of the human vocal tract. This task provides a promising direction for speech synthesis research, as the articulatory space is compact, smooth, and interpretable. Current works have highlighted the potential for deep learning models to perform articulatory synthesis. However, it remains unclear whether these models can achieve the efficiency and fidelity of the human speech production system. To help bridge this gap, we propose a time-domain articulatory synthesis methodology and demonstrate its efficacy with both electromagnetic articulography (EMA) and synthetic articulatory feature inputs. Our model is computationally efficient and achieves a transcription word error rate (WER) of 18.5% for the EMA-to-speech task, yielding an improvement of 11.6% compared to prior work. Through interpolation experiments, we also highlight the generalizability and interpretability of our approach.
翻译:在动脉合成任务中,通过含有关于人声道物理行为信息的信息的投入特征合成语言。这一任务为语音合成研究提供了一个有希望的方向,因为动脉空间是紧凑、顺畅和可解释的。目前的工作突出了深层学习模型进行动脉合成的潜力。然而,尚不清楚这些模型能否实现人类语音制作系统的效率和忠诚。为了帮助弥合这一差距,我们提出了一个时间-部间合成方法,并用电磁动脉动学和合成脉动特征投入来展示其有效性。我们的模型具有计算效率,实现了EMA-声学任务18.5%的抄录字错误率,比先前的工作提高了11.6%。我们通过内插实验,还强调了我们方法的可概括性和可解释性。