We investigate multi-speaker speech recognition from ultrasound images of the tongue and video images of the lips. We train our systems on imaging data from modal speech, and evaluate on matched test sets of two speaking modes: silent and modal speech. We observe that silent speech recognition from imaging data underperforms compared to modal speech recognition, likely due to a speaking-mode mismatch between training and testing. We improve silent speech recognition performance using techniques that address the domain mismatch, such as fMLLR and unsupervised model adaptation. We also analyse the properties of silent and modal speech in terms of utterance duration and the size of the articulatory space. To estimate the articulatory space, we compute the convex hull of tongue splines, extracted from ultrasound tongue images. Overall, we observe that the duration of silent speech is longer than that of modal speech, and that silent speech covers a smaller articulatory space than modal speech. Although these two properties are statistically significant across speaking modes, they do not directly correlate with word error rates from speech recognition.
翻译:我们用超声波图像调查舌头和嘴唇视频图像的多声语音识别; 我们用模式语言的图像数据培训我们的系统,并评价两种语言模式的匹配测试组:沉默和模式语言; 我们观察到,与模式语言识别相比,成像数据中的沉默语音识别不完善,这可能是由于培训和测试之间语言模式不匹配造成的; 我们使用诸如FMLLR和不受监督的模式适应等解决域错配的技术改进沉默语音识别功能; 我们还分析沉默和模式语言语言在语句持续时间和动脉空间大小方面的特性; 估计动脉空间, 我们从超声波舌头图像中提取的舌头样体结构。 总的来说,我们观察到,沉默语句持续的时间长于模式语言表达的长度, 静语表达覆盖的艺术空间小于模式语言表达。 尽管这两种特性在统计学上具有重大意义, 但是它们与语音识别的文字错误率没有直接关联。