Non-autoregressive (NAR) models have achieved a large inference computation reduction and comparable results with autoregressive (AR) models on various sequence to sequence tasks. However, there has been limited research aiming to explore the NAR approaches on sequence to multi-sequence problems, like multi-speaker automatic speech recognition (ASR). In this study, we extend our proposed conditional chain model to NAR multi-speaker ASR. Specifically, the output of each speaker is inferred one-by-one using both the input mixture speech and previously-estimated conditional speaker features. In each step, a NAR connectionist temporal classification (CTC) encoder is used to perform parallel computation. With this design, the total inference steps will be restricted to the number of mixed speakers. Besides, we also adopt the Conformer and incorporate an intermediate CTC loss to improve the performance. Experiments on WSJ0-Mix and LibriMix corpora show that our model outperforms other NAR models with only a slight increase of latency, achieving WERs of 22.3% and 24.9%, respectively. Moreover, by including the data of variable numbers of speakers, our model can even better than the PIT-Conformer AR model with only 1/7 latency, obtaining WERs of 19.9% and 34.3% on WSJ0-2mix and WSJ0-3mix sets. All of our codes are publicly available at https://github.com/pengchengguo/espnet/tree/conditional-multispk.
翻译:非自动递增模式(NAR) 已经实现了大幅推论计算递减, 并在自动递减模式(AR) 中取得了类似结果 。 然而, 研究范围有限, 旨在探索关于多个序列问题序列的NAR方法, 如多发式自动语音识别( ASR ) 。 在这项研究中, 我们将我们提议的有条件连锁模式扩展至 NAR 多发式自动语音识别( ASR ) 。 具体地说, 使用输入混合演讲和先前估计的有条件演讲功能, 每位演讲者的产出被逐个推导出。 每一步, 使用一个NAR连接式时间分类(IC) 来进行平行计算。 但是, 研究范围有限, 旨在探索多发式多个序列问题序列的 NAR方法。 此外, 我们还采用Conerexer, 并纳入一个中间的 CCT损失来改善性能。 WSJ0- Mix 和 LibriMix Coriolana 实验显示, 我们的模型比其他全发式NARM 模型稍差一些,, 只增加 LANNER, 甚至达到22.3% 和24.9% ARIS 。 此外, 我们的 RIS- serm- sexional- sexional- smlentional 数据可变数, 。