This work introduces the Cleanformer, a streaming multichannel neural based enhancement frontend for automatic speech recognition (ASR). This model has a conformer-based architecture which takes as inputs a single channel each of raw and enhanced signals, and uses self-attention to derive a time-frequency mask. The enhanced input is generated by a multichannel adaptive noise cancellation algorithm known as Speech Cleaner, which makes use of noise context to derive its filter taps. The time-frequency mask is applied to the noisy input to produce enhanced output features for ASR. Detailed evaluations are presented with simulated and re-recorded datasets in speech-based and non-speech-based noise that show significant reduction in word error rate (WER) when using a large-scale state-of-the-art ASR model. It also will be shown to significantly outperform enhancement using a beamformer with ideal steering. The enhancement model is agnostic of the number of microphones and array configuration and, therefore, can be used with different microphone arrays without the need for retraining. It is demonstrated that performance improves with more microphones, up to 4, with each additional microphone providing a smaller marginal benefit. Specifically, for an SNR of -6dB, relative WER improvements of about 80\% are shown in both noise conditions.
翻译:这项工作引入了 Cleanext, 这是一个流式多通道神经增强前端, 用于自动语音识别( ASR) 。 这个模型有一个基于校正的架构, 将每个原始信号和增强信号的单一频道作为输入器, 并使用自省来生成时频遮罩。 增强的输入是由名为“ 语音清洁” 的多频道适应性噪音取消算法生成的, 该算法使用噪音背景来提取过滤器。 时间- 频率遮罩应用于噪音输入中, 为 ASR 生成增强的输出功能。 详细评价用基于语音和非语音的噪音中模拟和重新录制数据集来显示, 在使用大型状态的 ASR 模型时, 将明显降低字差率( WER ) 。 还将显示, 使用一个使用理想方向的信号显示, 使用噪声背景来生成声音和阵列配置, 因而可以在无需再培训的情况下使用不同的麦克风阵列进行详细评价。 显示, 使用更大型的麦克风和无声波波波的音频率将大幅降低到 4 。, 将显示为S- 80- b 的每平级的频率将显示一个小的频率的比小的 RRC 。