Articulatory-to-acoustic mapping seeks to reconstruct speech from a recording of the articulatory movements, for example, an ultrasound video. Just like speech signals, these recordings represent not only the linguistic content, but are also highly specific to the actual speaker. Hence, due to the lack of multi-speaker data sets, researchers have so far concentrated on speaker-dependent modeling. Here, we present multi-speaker experiments using the recently published TaL80 corpus. To model speaker characteristics, we adjusted the x-vector framework popular in speech processing to operate with ultrasound tongue videos. Next, we performed speaker recognition experiments using 50 speakers from the corpus. Then, we created speaker embedding vectors and evaluated them on the remaining speakers. Finally, we examined how the embedding vector influences the accuracy of our ultrasound-to-speech conversion network in a multi-speaker scenario. In the experiments we attained speaker recognition error rates below 3%, and we also found that the embedding vectors generalize nicely to unseen speakers. Our first attempt to apply them in a multi-speaker silent speech framework brought about a marginal reduction in the error rate of the spectral estimation step.
翻译:脉动到声波映射图试图从脉动的录音中重建语音,例如超声波视频。就像语音信号一样,这些录音不仅代表语言内容,而且非常具体地代表实际发言者。因此,由于缺乏多声频数据集,研究人员迄今集中于以发言者为依存的模型。在这里,我们用最近出版的 TaL80 文体展示多声频实验。对于示范演讲者特点,我们调整了在语音处理中流行的X-矢量框架,以便使用超声波舌头视频操作。接下来,我们用文体中的50个发言者进行了语音识别实验。然后,我们创建了语音嵌入矢量,并对其余发言者进行了评估。最后,我们研究了在多声频假设中,矢量嵌入如何影响我们超声波到声频转换网络的准确性。在试验中,我们将发言者识别误差率降至3%以下。我们还发现,嵌入矢量矢量非常接近隐形者。我们第一次尝试将其应用于多声频光谱度的静音框架,从而降低了边缘误差率。