Personalised speech enhancement (PSE), which extracts only the speech of a target user and removes everything else from a recorded audio clip, can potentially improve users' experiences of audio AI modules deployed in the wild. To support a large variety of downstream audio tasks, such as real-time ASR and audio-call enhancement, a PSE solution should operate in a streaming mode, i.e., input audio cleaning should happen in real-time with a small latency and real-time factor. Personalisation is typically achieved by extracting a target speaker's voice profile from an enrolment audio, in the form of a static embedding vector, and then using it to condition the output of a PSE model. However, a fixed target speaker embedding may not be optimal under all conditions. In this work, we present a streaming Transformer-based PSE model and propose a novel cross-attention approach that gives adaptive target speaker representations. We present extensive experiments and show that our proposed cross-attention approach outperforms competitive baselines consistently, even when our model is only approximately half the size.
翻译:个人化语音增强(PSE)只提取目标用户的语音,从录音剪辑中删除其他所有内容,有可能改善用户在野外部署的音频AI模块方面的经验。为支持大量下游音量任务,如实时ASR和音频呼叫增强,PSE解决方案应当以流态模式运作,即输入音频清洁应实时进行,使用小的耐久和实时因素。个性化通常通过从注册音频中提取目标发言者的语音描述,以静态嵌入矢量为形式,然后使用它来为PSE模型的输出设定条件。然而,固定目标发言者嵌入可能并不是所有条件下的最佳选择。在这项工作中,我们提出了一个基于流式变换器的 PSE模式,并提出一种新的交叉注意方法,提供适应性目标扬声器的演示。我们介绍了广泛的实验,并展示了我们提议的交叉注意方法始终符合竞争性基线,即使我们的模型只有大约一半大小。