Robust voice activity detection (VAD) is a challenging task in low signal-to-noise (SNR) environments. Recent studies show that speech enhancement is helpful to VAD, but the performance improvement is limited. To address this issue, here we propose a speech enhancement aided end-to-end multi-task model for VAD. The model has two decoders, one for speech enhancement and the other for VAD. The two decoders share the same encoder and speech separation network. Unlike the direct thought that takes two separated objectives for VAD and speech enhancement respectively, here we propose a new joint optimization objective -- VAD-masked scale-invariant source-to-distortion ratio (mSI-SDR). mSI-SDR uses VAD information to mask the output of the speech enhancement decoder in the training process. It makes the VAD and speech enhancement tasks jointly optimized not only at the shared encoder and separation network, but also at the objective level. It also satisfies real-time working requirement theoretically. Experimental results show that the multi-task method significantly outperforms its single-task VAD counterpart. Moreover, mSI-SDR outperforms SI-SDR in the same multi-task setting.
翻译:强音活动检测( VAD) 是低信号到噪音环境中的一项艰巨任务。 最近的研究显示, 语音增强有助于 VAD, 但性能改进有限。 为了解决这个问题, 我们在此建议 VAD 的语音增强帮助端到端多任务模式。 模型有两个解码器, 一个用于语音增强, 另一个用于 VAD 。 两个解码器共享相同的编码器和语音分离网络。 与分别需要两个不同的 VAD 和语音增强目标的直接想法不同, 我们在这里提出了一个新的联合优化目标 -- -- VAD 设定的大小变异源到扭曲比率( MSI- SDR) 。 mSI- SDR 使用 VAD 信息来掩盖语言增强解码器在培训过程中的输出。 它使 VAD 和语音增强任务不仅在共享的编码器和分离网络上, 而且在目标级别上都得到了优化。 它还满足了实时的工作要求。 实验结果表明, 多塔克方法大大超越了 SI- DRADS 的单式系统。