Silent Speech Interfaces aim to reconstruct the acoustic signal from a sequence of ultrasound tongue images that records the articulatory movement. The extraction of information about the tongue movement requires us to efficiently process the whole sequence of images, not just as a single image. Several approaches have been suggested to process such a sequential image data. The classic neural network structure combines two-dimensional convolutional (2D-CNN) layers that process the images separately with recurrent layers (eg. an LSTM) on top of them to fuse the information along time. More recently, it was shown that one may also apply a 3D-CNN network that can extract information along both the spatial and the temporal axes in parallel, achieving a similar accuracy while being less time consuming. A third option is to apply the less well-known ConvLSTM layer type, which combines the advantages of LSTM and CNN layers by replacing matrix multiplication with the convolution operation. In this paper, we experimentally compared various combinations of the above mentions layer types for a silent speech interface task, and we obtained the best result with a hybrid model that consists of a combination of 3D-CNN and ConvLSTM layers. This hybrid network is slightly faster, smaller and more accurate than our previous 3D-CNN model. %with combination of (2+1)D CNN.
翻译:静音语音界面旨在从记录动脉运动的超声舌图像序列中重建声学信号。摘取关于舌头移动的信息要求我们高效处理整个图像序列,而不仅仅是单一图像。建议了几种方法来处理这种顺序图像数据。经典神经网络结构将两维进化层(2D-CNN)结合在一起,将图像与上方的反复层(如LSTM)分开处理,以便将信息与时间连接起来。最近,人们还表明,可能同时应用3D-CNN网络,可以在空间轴和时轴上平行提取信息,实现相似的准确性,同时减少时间消耗。第三个选择是应用不太为人所知的ConvLSTM层类型,将LSTM和CNND层的优势结合在一起,将图像与卷动操作的矩阵放大。在本文中,我们实验性地比较了上述图层类型的各种组合,提到静音界面,我们获得了最佳的混合模型,该模型由3D的精确度网络(2-CNN)和CONTMS-D层组成,比前方略的组合更快。