For the task of speech separation, previous study usually treats multi-channel and single-channel scenarios as two research tracks with specialized solutions developed respectively. Instead, we propose a simple and unified architecture - DasFormer (Deep alternating spectrogram transFormer) to handle both of them in the challenging reverberant environments. Unlike frame-wise sequence modeling, each TF-bin in the spectrogram is assigned with an embedding encoding spectral and spatial information. With such input, DasFormer is then formed by multiple repetition of simple blocks each of which integrates 1) two multi-head self-attention (MHSA) modules alternately processing within each frequency bin & temporal frame of the spectrogram 2) MBConv before each MHSA for modeling local features on the spectrogram. Experiments show that DasFormer has a powerful ability to model the time-frequency representation, whose performance far exceeds the current SOTA models in multi-channel speech separation, and also achieves single-channel SOTA in the more challenging yet realistic reverberation scenario.
翻译:对于语音分离的任务,先前的研究通常将多通道和单一通道情景作为两个研究轨道分别开发的专门解决方案。 相反,我们提议一个简单和统一的架构 — DasFormer(深交相光谱翻转翻转器),在具有挑战性的回旋环境中处理这两种情景。 与框架-逻辑模型不同,光谱中的每个TF-bin都配有嵌入编码光谱和空间信息。 有了这种输入,DasFormer(DasFormer)会由简单块的多重重复组成,其中每个区块融合了1个),两个多头自知模块(MHSA)在光谱2的每个频率、时框中互换处理。 实验显示,DasFormer(DasFormer)有强大的能力来模拟时间频率代表,其性能远远超过当前多频道语音分离的SOTA模型,并在更具挑战性、更现实的反位情景中实现单频道SOTA。</s>