Given a multi-microphone recording of an unknown number of speakers talking concurrently, we simultaneously localize the sources and separate the individual speakers. At the core of our method is a deep network, in the waveform domain, which isolates sources within an angular region $\theta \pm w/2$, given an angle of interest $\theta$ and angular window size $w$. By exponentially decreasing $w$, we can perform a binary search to localize and separate all sources in logarithmic time. Our algorithm allows for an arbitrary number of potentially moving speakers at test time, including more speakers than seen during training. Experiments demonstrate state-of-the-art performance for both source separation and source localization, particularly in high levels of background noise.
翻译:鉴于对同时交谈的发言者数量不详的多式麦克风记录,我们同时对发言来源进行本地化,并将个别发言者分开。我们的方法的核心是,在波形域内,一个深度的网络,将角区域内的源隔开来,考虑到一个感兴趣的角度,即$\theta $\ pm w/2美元和角窗口大小为$w美元。通过指数下降,我们可以进行二进制搜索,在对数时将所有来源本地化和分离。我们的算法允许在测试时任意选择一些可能移动的源,包括比培训期间更多的发言者。实验显示了源分离和源本地化的最新表现,特别是在高背景噪音方面。