Phoneme recognition is a very important part of speech recognition that requires the ability to extract phonetic features from multiple frames. In this paper, we compare and analyze CNN, RNN, Transformer, and Conformer models using phoneme recognition. For CNN, the ContextNet model is used for the experiments. First, we compare the accuracy of various architectures under different constraints, such as the receptive field length, parameter size, and layer depth. Second, we interpret the performance difference of these models, especially when the observable sequence length varies. Our analyses show that Transformer and Conformer models benefit from the long-range accessibility of self-attention through input frames.
翻译:电话识别是语音识别的一个非常重要的部分, 需要能够从多个框中提取语音特征。 在本文中, 我们用电话识别对CNN、 RNN、 变换器和 Confred 模型进行对比和分析。 对于CNN, 实验使用上下文网络模型。 首先, 我们比较不同限制下的各种结构的准确性, 如可接受字段长度、 参数大小和层深。 其次, 我们解释这些模型的性能差异, 特别是当可观测序列长度不同时。 我们的分析显示, 变换器和 Confred 模型受益于通过输入框架远程自我关注的利用。