We propose RemixIT, a simple and novel self-supervised training method for speech enhancement. The proposed method is based on a continuously self-training scheme that overcomes limitations from previous studies including assumptions for the in-domain noise distribution and having access to clean target signals. Specifically, a separation teacher model is pre-trained on an out-of-domain dataset and is used to infer estimated target signals for a batch of in-domain mixtures. Next, we bootstrap the mixing process by generating artificial mixtures using permuted estimated clean and noise signals. Finally, the student model is trained using the permuted estimated sources as targets while we periodically update teacher's weights using the latest student model. Our experiments show that RemixIT outperforms several previous state-of-the-art self-supervised methods under multiple speech enhancement tasks. Additionally, RemixIT provides a seamless alternative for semi-supervised and unsupervised domain adaptation for speech enhancement tasks, while being general enough to be applied to any separation task and paired with any separation model.
翻译:我们建议使用一个简单和新颖的自我监督的增强言语能力的培训方法,即RemixIT, 这是一种简单和新颖的增强言语能力的自我监督培训方法; 拟议的方法基于一个持续自我培训计划,该计划克服了以往研究的局限性,包括内部噪音分布的假设和获得清洁目标信号的机会。 具体地说, 分离教师模型在外域数据集上预先接受了培训,并用来推断一批内域混合物的估计目标信号。 其次, 我们用透视估计清洁和噪声信号生成人工混合物,从而将混合过程捆绑起来。 最后, 学生模型是使用透视估计来源作为目标来培训的, 而我们则利用最新的学生模型定期更新教师的重量。 我们的实验显示,在多个语音增强任务下, RemixIT 超越了以前一些最先进的自我监督方法。 此外, RemixIT为半监督和不超导域域适应语音增强任务提供了一个无缝合的替代方法。 同时,它很笼统,足以应用于任何分离任务,并与任何分离模型组合。