We present a single-stage casual waveform-to-waveform multichannel model that can separate moving sound sources based on their broad spatial locations in a dynamic acoustic scene. We divide the scene into two spatial regions containing, respectively, the target and the interfering sound sources. The model is trained end-to-end and performs spatial processing implicitly, without any components based on traditional processing or use of hand-crafted spatial features. We evaluate the proposed model on a real-world dataset and show that the model matches the performance of an oracle beamformer followed by a state-of-the-art single-channel enhancement network.
翻译:我们提出了一个单阶段临时波形对波形多通道模型,该模型可以根据声学场景中广泛的空间位置将声源分开,将场景分为两个空间区域,分别包含目标声源和干扰声源,该模型经过培训,终端至终端,并隐含地进行空间处理,没有基于传统处理或使用手工艺空间特征的任何部件,我们评价了真实世界数据集的拟议模型,并表明该模型与一个神器波束的性能相匹配,随后有一个最先进的单一声道增强网络。