Ultrasound tongue imaging is widely used for speech production research, and it has attracted increasing attention as its potential applications seem to be evident in many different fields, such as the visual biofeedback tool for second language acquisition and silent speech interface. Unlike previous studies, here we explore the feasibility of age estimation using the ultrasound tongue image of the speakers. Motivated by the success of deep learning, this paper leverages deep learning on this task. We train a deep convolutional neural network model on the UltraSuite dataset. The deep model achieves mean absolute error (MAE) of 2.03 for the data from typically developing children, while MAE is 4.87 for the data from the children with speech sound disorders, which suggest that age estimation using ultrasound is more challenging for the children with speech sound disorder. The developed method can be used a tool to evaluate the performance of speech therapy sessions. It is also worthwhile to notice that, although we leverage the ultrasound tongue imaging for our study, the proposed methods may also be extended to other imaging modalities (e.g. MRI) to assist the studies on speech production.
翻译:超声波舌成像被广泛用于语音制作研究,并引起越来越多的关注,因为其潜在应用似乎在许多不同领域明显可见,例如第二语言获取和静音界面的视觉生物回馈工具。与以往的研究不同,我们在这里探索使用发言者超声波舌图像进行年龄估计的可行性。由于深层学习的成功,本文件利用了对这项任务的深层学习。我们在超超声波数据集上培训了一个深层神经网络模型。深层模型在典型发育中儿童的数据中达到了2.03的绝对误差(MAE),而语言声音失常儿童的数据中MAE为4.87,这表明使用超声波进行年龄估计对于语言声音障碍儿童来说更具挑战性。开发的方法可以用来评估语言治疗课程的绩效。还值得注意的是,尽管我们利用超声波舌成像来进行我们的研究,但拟议的方法也可以扩大到其他成像模式(例如MRI),以协助语音制作研究。