使用软-最低最低调动训练进行单通道语音分离 (Single-channel speech separation using Soft-minimum Permutation Invariant Training)

The goal of speech separation is to extract multiple speech sources from a single microphone recording. Recently, with the advancement of deep learning and availability of large datasets, speech separation has been formulated as a supervised learning problem. These approaches aim to learn discriminative patterns of speech, speakers, and background noise using a supervised learning algorithm, typically a deep neural network. A long-lasting problem in supervised speech separation is finding the correct label for each separated speech signal, referred to as label permutation ambiguity. Permutation ambiguity refers to the problem of determining the output-label assignment between the separated sources and the available single-speaker speech labels. Finding the best output-label assignment is required for calculation of separation error, which is later used for updating parameters of the model. Recently, Permutation Invariant Training (PIT) has been shown to be a promising solution in handling the label ambiguity problem. However, the overconfident choice of the output-label assignment by PIT results in a sub-optimal trained model. In this work, we propose a probabilistic optimization framework to address the inefficiency of PIT in finding the best output-label assignment. Our proposed method entitled trainable Soft-minimum PIT is then employed on the same Long-Short Term Memory (LSTM) architecture used in Permutation Invariant Training (PIT) speech separation method. The results of our experiments show that the proposed method outperforms conventional PIT speech separation significantly (p-value $ < 0.01$) by +1dB in Signal to Distortion Ratio (SDR) and +1.5dB in Signal to Interference Ratio (SIR).

翻译：语音分离的目标是从单一的麦克风录音中提取多种语音源。最近,随着深度学习的进步和大型数据集的可用性,语音分离被设计成受监督的学习问题。这些方法的目的是利用监督的学习算法(通常是深神经网络)来学习语言、演讲人和背景噪音的歧视性模式。监督的语音分离的一个长期问题就是为每个分离的语音信号找到正确的标签,称为标签变换模糊性。调换模糊性指的是确定分离的源和现有单声调语音标签之间输出标签分配的问题。找到计算分离错误所需的最佳输出标签分配, 后用于更新模型的参数。最近, 调换性不动性培训(PIT) 显示处理标签模糊性问题的可行办法。然而, PIT 输出标签分配的过于自信选择导致一个亚最佳的训练模式。在这项工作中,我们提议一个精确的优化框架, 解决 PIT 的成本效益, 以找到最佳输出- 标值 Palbl 格式, 用于长期输出- IM 常规结构中的拟议方法, 显示长期输出- IMS- IM IM 格式格式。我们的拟议方法, 显示在长期打印- IMLA- 格式格式中, IM- IM- ta 格式格式格式格式格式- 显示使用格式格式- 格式格式格式格式格式格式格式格式格式格式格式格式格式格式格式,,,, 格式格式格式, 显示, 格式格式格式格式格式,,, 格式格式, 格式格式,, 格式格式,, 显示在格式格式,, 格式格式格式格式格式格式格式格式,,,,,,, 格式格式格式格式格式格式格式格式格式格式格式格式格式格式格式格式格式格式格式格式格式,,,, 格式-, 格式格式格式格式格式化格式格式格式格式格式格式格式格式格式