Real-Time Magnetic resonance imaging (rtMRI) of the midsagittal plane of the mouth is of interest for speech production research. In this work, we focus on estimating utterance level rtMRI video from the spoken phoneme sequence. We obtain time-aligned phonemes from forced alignment, to obtain frame-level phoneme sequences which are aligned with rtMRI frames. We propose a sequence-to-sequence learning model with a transformer phoneme encoder and convolutional frame decoder. We then modify the learning by using intermediary features obtained from sampling from a pretrained phoneme-conditioned variational autoencoder (CVAE). We train on 8 subjects in a subject-specific manner and demonstrate the performance with a subjective test. We also use an auxiliary task of air tissue boundary (ATB) segmentation to obtain the objective scores on the proposed models. We show that the proposed method is able to generate realistic rtMRI video for unseen utterances, and adding CVAE is beneficial for learning the sequence-to-sequence mapping for subjects where the mapping is hard to learn.
翻译:口腔中间平面的实时磁共振成像(rtMRI)对于语音制作研究很感兴趣。 在这项工作中,我们侧重于估计口声声波序列中的音量水平 rtMRI 视频。 我们从强制对齐中获得了时间调整的电话,以获得与 rtMRI 框架相匹配的台阶级电话序列。 我们提出了一个序列到序列的学习模型,配有一台变压器的电话编码器和电传框架解码器。 然后,我们通过使用从事先经过培训的电话辅助自动变换器取样中获得的中间特征来修改学习。 我们以特定主题的方式对8个主题进行了培训,并用主观测试来展示其性能。 我们还使用空气组织边界分割的辅助任务来获得拟议模型的目标分数。 我们表明,拟议方法能够产生现实的 RtMRI 视频,用于进行无形的音调,并添加 CVAE 有助于学习绘图难以学习的科目的序列到后映射图。