This paper proposes CTC-based non-autoregressive ASR with self-conditioned folded encoders. The proposed method realizes non-autoregressive ASR with fewer parameters by folding the conventional stack of encoders into only two blocks; base encoders and folded encoders. The base encoders convert the input audio features into a neural representation suitable for recognition. This is followed by the folded encoders applied repeatedly for further refinement. Applying the CTC loss to the outputs of all encoders enforces the consistency of the input-output relationship. Thus, folded encoders learn to perform the same operations as an encoder with deeper distinct layers. In experiments, we investigate how to set the number of layers and the number of iterations for the base and folded encoders. The results show that the proposed method achieves a performance comparable to that of the conventional method using only 38% as many parameters. Furthermore, it outperforms the conventional method when increasing the number of iterations.
翻译:本文用自制折叠编码器提出基于CTC的非自动反导 ASR。 拟议的方法将常规编码器堆叠成两个区块, 基编码器和折叠编码器, 使非自动反导的 ASR的参数更少。 基编码器将输入音频特性转换成适合识别的神经表示器。 之后, 折叠编码器反复应用进一步改进。 对所有编码器的输出适用, 使输入- 输出关系的一致性得到应用。 因此, 折叠的编码器学会作为具有更深不同层的编码器进行同样的操作。 在实验中, 我们调查如何为基号和折叠编码器设定层数和迭代数。 结果表明, 拟议的方法的性能与常规方法的性能相当, 仅使用38%的参数来进行进一步调整。 此外, 在增加重复次数时, 它比常规方法要差。