Recent work has shown that it is possible to train a single model to perform joint acoustic echo cancellation (AEC), speech enhancement, and voice separation, thereby serving as a unified frontend for robust automatic speech recognition (ASR). The joint model uses contextual information, such as a reference of the playback audio, noise context, and speaker embedding. In this work, we propose a number of novel improvements to such a model. First, we improve the architecture of the Cross-Attention Conformer that is used to ingest noise context into the model. Second, we generalize the model to be able to handle varying lengths of noise context. Third, we propose Signal Dropout, a novel strategy that models missing contextual information. In the absence of one or more signals, the proposed model performs nearly as well as task-specific models trained without these signals; and when such signals are present, our system compares well against systems that require all context signals. Over the baseline, the final model retains a relative word error rate reduction of 25.0% on background speech when speaker embedding is absent, and 61.2% on AEC when device playback is absent.
翻译:最近的工作表明,有可能对单一模型进行培训,以进行联合声响取消、语音增强和声音分离,从而成为稳健自动语音识别的统一前端。联合模型使用背景信息,如重播音频、噪音背景和扩音器嵌入的参考。在这项工作中,我们建议对这一模型进行一些新的改进。首先,我们改进了用于在模型中吸收最大噪声背景的跨保护连接的架构。第二,我们笼统地概括了该模型,以便能够处理不同长度的噪音背景。第三,我们提议了信号退出,这是一个新战略,即缺少背景信息模型。在缺少一个或多个信号的情况下,拟议模型几乎与在没有这些信号的情况下所训练的任务特定模型一样;在出现这些信号时,我们的系统与需要所有背景信号的系统相比要好。在基线中,最后模型保留了在演讲者嵌入时背景演讲时25.0%的相对字差率降低率,而在设备回放时则保留61.2%的词差率。