Silent speech interfaces (SSI) aim to reconstruct the speech signal from a recording of the articulatory movement, such as an ultrasound video of the tongue. Currently, deep neural networks are the most successful technology for this task. The efficient solution requires methods that do not simply process single images, but are able to extract the tongue movement information from a sequence of video frames. One option for this is to apply recurrent neural structures such as the long short-term memory network (LSTM) in combination with 2D convolutional neural networks (CNNs). Here, we experiment with another approach that extends the CNN to perform 3D convolution, where the extra dimension corresponds to time. In particular, we apply the spatial and temporal convolutions in a decomposed form, which proved very successful recently in video action recognition. We find experimentally that our 3D network outperforms the CNN+LSTM model, indicating that 3D CNNs may be a feasible alternative to CNN+LSTM networks in SSI systems.
翻译:静音界面(SSI)旨在从脉动记录中重建语音信号,例如舌头超声波视频。 目前,深神经网络是最成功的技术。 高效的解决方案需要的方法不仅仅是处理单个图像,而是能够从视频框序列中提取舌头移动信息。 其中一个选项是应用诸如长期短期内存网络(LSTM)等经常性神经结构,与2D脉动神经网络(CNNs)相结合。 在这里,我们尝试另一种方法,将CNN扩展至3D演动,其额外维度与时间相对应。 特别是,我们以分解的形式应用空间和时间演动,最近在视频动作识别中证明非常成功。 我们通过实验发现,我们的3D网络超越CNN+LSTM模型,表明3DCNN可能是S系统CNN+LSTM网络的一个可行的替代方案。