This work proposes a learnable filterbank based on a multi-channel masking framework for multi-channel source separation. The learnable filterbank is a 1D Conv layer, which transforms the raw waveform into a 2D representation. In contrast to the conventional single-channel masking method, we estimate a mask for each individual microphone channel. The estimated masks are then applied to the transformed waveform representation like in the traditional filter-and-sum beamforming operation. Specifically, each mask is used to multiply the corresponding channel's 2D representation, and the masked output of all channels are then summed. At last, a 1D transposed Conv layer is used to convert the summed masked signal into the waveform domain. The experimental results show our method outperforms single-channel masking with a learnable filterbank and can outperform multi-channel complex masking with STFT complex spectrum in the STGCSEN model if a learnable filterbank is transformed to a higher feature dimension. The spatial response analysis also verifies that multi-channel masking in the learnable filterbank domain has spatial selectivity.
翻译:这项工作基于多通道源分离的多通道掩码框架, 提出了一个可学习的过滤库。 可学习的过滤库是一个 1D Conv 层, 将原始波形转换成 2D 代表面。 与传统的单一通道掩码方法不同, 我们估计每个麦克风频道都有一个掩码。 然后, 估计的遮罩应用到变换的波形代表面上, 像传统的过滤器和光成形操作一样。 具体地说, 每个遮罩都用来增加相应的频道的 2D 代表面, 然后将所有频道的遮罩输出相加。 最后, 一个 1D 移植的Conv 层将总化的掩码转换成波形区域。 实验结果显示, 我们的方法超越了单道掩码, 与可学习的过滤库中的STGCSEN 模型中与STFT 复杂频谱的多频道遮罩面, 如果一个可学习的过滤库转换成更高的特性尺寸, 空间反应分析还证实, 学习的过滤库中的多通道掩码掩码面也具有空间选择性 。</s>