High quality speech capture has been widely studied for both voice communication and human computer interface reasons. To improve the capture performance, we can often find multi-microphone speech enhancement techniques deployed on various devices. Multi-microphone speech enhancement problem is often decomposed into two decoupled steps: a beamformer that provides spatial filtering and a single-channel speech enhancement model that cleans up the beamformer output. In this work, we propose a speech enhancement solution that takes both the raw microphone and beamformer outputs as the input for an ML model. We devise a simple yet effective training scheme that allows the model to learn from the cues of the beamformer by contrasting the two inputs and greatly boost its capability in spatial rejection, while conducting the general tasks of denoising and dereverberation. The proposed solution takes advantage of classical spatial filtering algorithms instead of competing with them. By design, the beamformer module then could be selected separately and does not require a large amount of data to be optimized for a given form factor, and the network model can be considered as a standalone module which is highly transferable independently from the microphone array. We name the ML module in our solution as GSENet, short for Guided Speech Enhancement Network. We demonstrate its effectiveness on real world data collected on multi-microphone devices in terms of the suppression of noise and interfering speech.
翻译:为语音通信和人类计算机界面的原因,对高质量语音捕捉进行了广泛的研究。为了改进捕捉性能,我们常常可以找到在各种装置上部署的多声扩音技术。多声扩音问题往往被分解成两个分解的步骤:一个提供空间过滤器的波束装置和一个清理波束输出的单声道扩音模型。在这项工作中,我们提出了一个语音增强解决方案,将原始麦克风和波束输出作为ML模型的输入。我们设计了一个简单而有效的培训计划,使模型能够通过对比两种输入并大大提升其在空间拒绝方面的能力,同时进行拆音和皮肤变异的一般任务。拟议解决方案利用经典空间过滤算法而不是与它们竞争。通过设计,然后可以单独选择波束模模模模模模块,而不需要大量的数据来优化特定的形式要素。我们可以将网络模型视为一个独立的独立模块,该模块在空间阻断方面可以高度可转让,同时进行空间阻断,同时进行空间阻断和降低空间阻断能力。我们收集的GS-LM-L 将数据定位模块用于真正的磁感应系统。</s>