We estimate articulatory movements in speech production from different modalities - acoustics and phonemes. Acoustic-to articulatory inversion (AAI) is a sequence-to-sequence task. On the other hand, phoneme to articulatory (PTA) motion estimation faces a key challenge in reliably aligning the text and the articulatory movements. To address this challenge, we explore the use of a transformer architecture - FastSpeech, with explicit duration modelling to learn hard alignments between the phonemes and articulatory movements. We also train a transformer model on AAI. We use correlation coefficient (CC) and root mean squared error (rMSE) to assess the estimation performance in comparison to existing methods on both tasks. We observe 154%, 11.8% & 4.8% relative improvement in CC with subject-dependent, pooled and fine-tuning strategies, respectively, for PTA estimation. Additionally, on the AAI task, we obtain 1.5%, 3% and 3.1% relative gain in CC on the same setups compared to the state-of-the-art baseline. We further present the computational benefits of having transformer architecture as representation blocks.
翻译:我们从不同的方式 -- -- 声音和电话 -- -- 来估计语音制作的动脉变化。 声向动脉变换( AAI) 是一个从顺序到顺序的任务。 另一方面, 动脉动估计( PTA) 的电话在可靠地统一文本和动脉运动方面面临着一个关键的挑战。 为了应对这一挑战, 我们探索使用变压器结构 -- -- FastSpeech, 具有明确期限的建模,以学习电话和动脉运动之间的硬吻合。 我们还在AAI上培训了一个变压器模型。 我们使用相关系数(CC)和根平均正方差(rMSE)来评估与两项任务的现有方法相比的估测绩效。 我们观察到CC分别有154%、11.8%和4.8%的相对改进,而PTA估计则有基于主题、集合和微调的战略。 此外,在AAI任务上,我们在同一套配置中取得了1.5%、3%和3.1%的相对增益。 我们进一步介绍了变压结构的计算收益。