调适无人看管的语音分离的兼容性培训 (Heterogeneous Separation Consistency Training for Adaptation of Unsupervised Speech Separation)

Recently, supervised speech separation has made great progress. However, limited by the nature of supervised training, most existing separation methods require ground-truth sources and are trained on synthetic datasets. This ground-truth reliance is problematic, because the ground-truth signals are usually unavailable in real conditions. Moreover, in many industry scenarios, the real acoustic characteristics deviate far from the ones in simulated datasets. Therefore, the performance usually degrades significantly when applying the supervised speech separation models to real applications. To address these problems, in this study, we propose a novel separation consistency training, termed SCT, to exploit the real-world unlabeled mixtures for improving cross-domain unsupervised speech separation in an iterative manner, by leveraging upon the complementary information obtained from heterogeneous (structurally distinct but behaviorally complementary) models. SCT follows a framework using two heterogeneous neural networks (HNNs) to produce high confidence pseudo labels of unlabeled real speech mixtures. These labels are then updated, and used to refine the HNNs to produce more reliable consistent separation results for real mixture pseudo-labeling. To maximally utilize the large complementary information between different separation networks, a cross-knowledge adaptation is further proposed. Together with simulated dataset, those real mixtures with high confidence pseudo labels are then used to update the HNN separation models iteratively. In addition, we find that combing the heterogeneous separation outputs by a simple linear fusion can further slightly improve the final system performance.

翻译：最近,有监督的言语分离取得了很大进展。然而,由于受监督的培训性质的限制,大多数现有分离方法都要求有地面真相来源,并接受合成数据集的培训。这种地面真相依赖存在问题,因为地面真相信号通常在真实条件下不存在。此外,在许多行业情景中,真正的声学特征与模拟数据集中的信息差异很大。因此,在将受监督的言语分离模型应用到真实应用程序时,性能通常会显著下降。为了解决这些问题,我们在本研究报告中提议进行新的分离一致性培训,称为SCT,利用真实世界无标签的混合物,以迭接方式改进交叉的、不受监督的言语分离。因为地面真相依赖是问题,因为地面真相信号信号信号信号信号通常无法在模拟数据集中找到一个框架。因此,在应用受监督的言语分离模型到真实应用时,这些标签可以进一步更新,并用来改进 HNNNS 来产生更可靠的、更可靠的分离结果,用于真实混合的假标签。在使用大规模互补性能性能化的网络中,用高的模型来更新。