In a scenario with multiple persons talking simultaneously, the spatial characteristics of the signals are the most distinct feature for extracting the target signal. In this work, we develop a deep joint spatial-spectral non-linear filter that can be steered in an arbitrary target direction. For this we propose a simple and effective conditioning mechanism, which sets the initial state of the filter's recurrent layers based on the target direction. We show that this scheme is more effective than the baseline approach and increases the flexibility of the filter at no performance cost. The resulting spatially selective non-linear filters can also be used for speech separation of an arbitrary number of speakers and enable very accurate multi-speaker localization as we demonstrate in this paper.
翻译:在多人同时说话的场景中,信号的空间特征是提取目标信号最明显的特征。在这项工作中,我们开发了一个深度联合空间-频谱非线性滤波器,可以在任意目标方向上引导。为此,我们提出了一种简单有效的条件机制,它基于目标方向设置滤波器的递归层的初始状态。我们展示了这种方案比基线方法更有效,并增加了滤波器的灵活性,而不需要任何性能成本。由此产生的空间选择性非线性滤波器也可以用于任意数量说话人的语音分离,并且可以实现非常精确的多说话人定位,正如我们在本文中所展示的。