With the recent surge of video conferencing tools usage, providing high-quality speech signals and accurate captions have become essential to conduct day-to-day business or connect with friends and families. Single-channel personalized speech enhancement (PSE) methods show promising results compared with the unconditional speech enhancement (SE) methods in these scenarios due to their ability to remove interfering speech in addition to the environmental noise. In this work, we leverage spatial information afforded by microphone arrays to improve such systems' performance further. We investigate the relative importance of speaker embeddings and spatial features. Moreover, we propose a new causal array-geometry-agnostic multi-channel PSE model, which can generate a high-quality enhanced signal from arbitrary microphone geometry. Experimental results show that the proposed geometry agnostic model outperforms the model trained on a specific microphone array geometry in both speech quality and automatic speech recognition accuracy. We also demonstrate the effectiveness of the proposed approach for unseen array geometries.
翻译:随着最近电视会议工具的使用激增,提供了高质量的语音信号和准确的字幕,这对开展日常业务或与朋友和家人联系至关重要。单一通道个人化语音增强方法与这些假设情景中无条件语音增强方法相比,显示了有希望的结果,因为这些方法除环境噪音外还能够消除干扰性言论。在这项工作中,我们利用麦克风阵列提供的空间信息来进一步改进这些系统的性能。我们调查了发言者嵌入和空间特征的相对重要性。此外,我们提出了一个新的因果阵列多频道PSE模型,该模型能够产生来自任意麦克风几何测量的高品质增强信号。实验结果表明,拟议的几何计量模型在语音质量和自动语音识别准确性两方面都超过了在特定麦克风阵列几何测量方面受过训练的模型。我们还展示了拟用于隐性阵列几何地理模型的有效性。