Synthesized speech from articulatory movements can have real-world use for patients with vocal cord disorders, situations requiring silent speech, or in high-noise environments. In this work, we present EMA2S, an end-to-end multimodal articulatory-to-speech system that directly converts articulatory movements to speech signals. We use a neural-network-based vocoder combined with multimodal joint-training, incorporating spectrogram, mel-spectrogram, and deep features. The experimental results confirm that the multimodal approach of EMA2S outperforms the baseline system in terms of both objective evaluation and subjective evaluation metrics. Moreover, results demonstrate that joint mel-spectrogram and deep feature loss training can effectively improve system performance.
翻译:在这项工作中,我们展示了EMA2S,这是一个端到端的多式动脉交响系统,将动脉运动直接转换为语音信号。我们使用基于神经网络的电解器,结合多式联合培训,结合光谱、光谱和深层特征。实验结果证实,EMA2S的多式联运方法在客观评价和主观评价衡量标准方面都超过了基线系统。此外,结果还表明,联合光谱和深度特征损失培训可以有效地改善系统性能。