Target speech separation is the process of filtering a certain speaker's voice out of speech mixtures according to the additional speaker identity information provided. Recent works have made considerable improvement by processing signals in the time domain directly. The majority of them take fully overlapped speech mixtures for training. However, since most real-life conversations occur randomly and are sparsely overlapped, we argue that training with different overlap ratio data benefits. To do so, an unavoidable problem is that the popularly used SI-SNR loss has no definition for silent sources. This paper proposes the weighted SI-SNR loss, together with the joint learning of target speech separation and personal VAD. The weighted SI-SNR loss imposes a weight factor that is proportional to the target speaker's duration and returns zero when the target speaker is absent. Meanwhile, the personal VAD generates masks and sets non-target speech to silence. Experiments show that our proposed method outperforms the baseline by 1.73 dB in terms of SDR on fully overlapped speech, as well as by 4.17 dB and 0.9 dB on sparsely overlapped speech of clean and noisy conditions. Besides, with slight degradation in performance, our model could reduce the time costs in inference.
翻译:目标语音分离是根据额外发言者身份信息,将某一发言者的声音从语音混合物中过滤出的过程。最近的工作通过直接处理时间范围内的信号而大大改进了时间范围内的信号,其中多数采用完全重叠的语音混合物来进行培训。然而,由于大多数真实生活中的对话是随机发生的,而且很少重叠,因此我们认为,培训与不同重叠比例数据的好处不同。为此,一个不可避免的问题是,普遍使用的SI-SNR损失对无声源没有定义。本文提议加权的SI-SNR损失,同时共同学习目标语音分离和个人VAD。加权的SI-SNR损失要求一个与目标发言者的时间长度成正比的重量系数,在目标发言者缺席时返回零。与此同时,个人VAD产生面具,设定非目标发言为沉默。实验表明,我们所提议的方法在完全重叠的演讲中比标准特别提款权的基线高出1.7 dB,以及4.17 dB和0.9 dB对清洁和噪音条件的微重的演讲比重。此外,由于性能轻微退化,我们的模型可以降低时间成本。