Human-robot interaction relies on a noise-robust audio processing module capable of estimating target speech from audio recordings impacted by environmental noise, as well as self-induced noise, so-called ego-noise. While external ambient noise sources vary from environment to environment, ego-noise is mainly caused by the internal motors and joints of a robot. Ego-noise and environmental noise reduction are often decoupled, i.e., ego-noise reduction is performed without considering environmental noise. Recently, a variational autoencoder (VAE)-based speech model has been combined with a fully adaptive non-negative matrix factorization (NMF) noise model to recover clean speech under different environmental noise disturbances. However, its enhancement performance is limited in adverse acoustic scenarios involving, e.g. ego-noise. In this paper, we propose a multichannel partially adaptive scheme to jointly model ego-noise and environmental noise utilizing the VAE-NMF framework, where we take advantage of spatially and spectrally structured characteristics of ego-noise by pre-training the ego-noise model, while retaining the ability to adapt to unknown environmental noise. Experimental results show that our proposed approach outperforms the methods based on a completely fixed scheme and a fully adaptive scheme when ego-noise and environmental noise are present simultaneously.
翻译:人机交互依赖于一个噪声鲁棒的音频处理模块,能够从受环境噪声和自身噪声(称为自我噪声)影响的音频录制中估计目标语音。虽然外部环境噪声源因环境而异,但自我噪声主要由机器人的内部电机和关节引起。自我噪声和环境噪声的减少通常是分离的,即在不考虑环境噪声的情况下进行自我噪声的减少。最近,一种基于变分自编码器(VAE)的语音模型已经与完全自适应的非负矩阵分解(NMF)噪声模型结合起来,以在不同的环境噪声干扰下恢复出清晰的语音。然而,在涉及自我噪声的恶劣声学场景中,其增强性能有限。在本文中,我们提出了一种多通道部分自适应方案,利用VAE-NMF框架同时建模自我噪声和环境噪声,其中我们利用自适应前的自我噪声模型的空间和频谱结构特征,同时保留对未知环境噪声的适应能力。实验结果表明,当存在自我噪声和环境噪声时,我们提出的方法优于基于完全固定方案和完全自适应方案的方法。