Previous studies have confirmed the effectiveness of leveraging articulatory information to attain improved speech enhancement (SE) performance. By augmenting the original acoustic features with the place/manner of articulatory features, the SE process can be guided to consider the articulatory properties of the input speech when performing enhancement. Hence, we believe that the contextual information of articulatory attributes should include useful information and can further benefit SE in different languages. In this study, we propose an SE system that improves its performance through optimizing the contextual articulatory information in enhanced speech for both English and Mandarin. We optimize the contextual articulatory information through joint-train the SE model with an end-to-end automatic speech recognition (E2E ASR) model, predicting the sequence of broad phone classes (BPC) instead of the word sequences. Meanwhile, two training strategies are developed to train the SE system based on the BPC-based ASR: multitask-learning and deep-feature training strategies. Experimental results on the TIMIT and TMHINT dataset confirm that the contextual articulatory information facilitates an SE system in achieving better results than the traditional Acoustic Model(AM). Moreover, in contrast to another SE system that is trained with monophonic ASR, the BPC-based ASR (providing contextual articulatory information) can improve the SE performance more effectively under different signal-to-noise ratios(SNR).
翻译:先前的研究证实了利用动脉信息提高语音增强性能的有效性。通过使用动脉特征的定位/定位器强化原始声学特征,可以引导SE进程在进行增强时考虑输入语音的动脉特性。因此,我们认为,动脉属性的背景信息应当包括有用信息,并能够进一步以不同语言使SE受益。在本研究中,我们提出一个SE系统,通过优化英语和曼达林语强化演讲中的背景脉动信息来改进其性能。我们通过使用端对端自动语音识别模型(E2E ASR)来优化SE模型的背景动脉动信息,预测广泛电话类(BPC)的序列,而不是字序列。与此同时,我们制定了两个培训战略,以基于BPC的ASR:多任务学习和深度能力培训战略为基础,在TIMIT和TMHINT数据集的实验结果中确认,通过环境动脉动动脉动信息通过端对SE-SRA系统进行优化,在SEPRA系统下,通过经过培训的A-SEPRA系统,可以更好地实现A-SIMA的另一种磁感动脉动系统。在A-A-A-SIR系统下,可以有效地改进A-A-A-A-A-SIM-A-A-A-SIM-A-A-A-A-A-A-SIM-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A