We propose an end-to-end trainable approach to single-channel speech separation with unknown number of speakers. Our approach extends the MulCat source separation backbone with additional output heads: a count-head to infer the number of speakers, and decoder-heads for reconstructing the original signals. Beyond the model, we also propose a metric on how to evaluate source separation with variable number of speakers. Specifically, we cleared up the issue on how to evaluate the quality when the ground-truth hasmore or less speakers than the ones predicted by the model. We evaluate our approach on the WSJ0-mix datasets, with mixtures up to five speakers. We demonstrate that our approach outperforms state-of-the-art in counting the number of speakers and remains competitive in quality of reconstructed signals.
翻译:我们建议一种最终到最终的训练方法,用人数不详的发言者进行单一频道的语音分离。我们的方法将MulCat源源的分离主干网扩展为额外的输出头:一个计算出发言者人数的计数头,以及重建原始信号的解码头。除了模型外,我们还提出了如何用不同人数的发言者来评价源分离的衡量标准。具体地说,我们澄清了当地面实况的发言者比模型预测的要多或少时如何评价质量的问题。我们评估了我们在WSJ0混合数据集上的做法,混合了多达5个发言者。我们证明,我们的方法在计算发言者人数方面超过了最先进的标准,在重建信号的质量方面仍然具有竞争力。