This work introduces \emph{cross-attention conformer}, an attention-based architecture for context modeling in speech enhancement. Given that the context information can often be sequential, and of different length as the audio that is to be enhanced, we make use of cross-attention to summarize and merge contextual information with input features. Building upon the recently proposed conformer model that uses self attention layers as building blocks, the proposed cross-attention conformer can be used to build deep contextual models. As a concrete example, we show how noise context, i.e., short noise-only audio segment preceding an utterance, can be used to build a speech enhancement feature frontend using cross-attention conformer layers for improving noise robustness of automatic speech recognition.
翻译:这项工作引入了 emph{ 交叉注意兼容器}, 这是用于语音增强中背景建模的基于关注的架构 。 鉴于背景信息通常可以按顺序排列, 与要增强的音频的长度不同, 我们利用交叉注意将背景信息与输入功能进行总结和合并。 根据最近提出的以自我注意层作为构件的匹配模型, 拟议的跨注意兼容器可以用来构建深层背景模型。 具体的例子就是, 我们展示了如何利用噪音环境, 即发言前的短音- 仅限音频段, 使用交叉注意兼容器层来构建语音增强前端, 以提高自动语音识别的噪音稳健性 。