Although deep learning (DL) has achieved notable progress in speech enhancement (SE), further research is still required for a DL-based SE system to adapt effectively and efficiently to particular speakers. In this study, we propose a novel meta-learning-based speaker-adaptive SE approach (called OSSEM) that aims to achieve SE model adaptation in a one-shot manner. OSSEM consists of a modified transformer SE network and a speaker-specific masking (SSM) network. In practice, the SSM network takes an enrolled speaker embedding extracted using ECAPA-TDNN to adjust the input noisy feature through masking. To evaluate OSSEM, we designed a modified Voice Bank-DEMAND dataset, in which one utterance from the testing set was used for model adaptation, and the remaining utterances were used for testing the performance. Moreover, we set restrictions allowing the enhancement process to be conducted in real time, and thus designed OSSEM to be a causal SE system. Experimental results first show that OSSEM can effectively adapt a pretrained SE model to a particular speaker with only one utterance, thus yielding improved SE results. Meanwhile, OSSEM exhibits a competitive performance compared to state-of-the-art causal SE systems.
翻译:尽管在提高语言能力方面取得了显著进展(DL),但仍需要进一步研究,使基于DL的SE系统能够有效和高效地适应特定发言者。在本研究中,我们建议采用一种新型的基于新元的基于学习的SE演讲者适应SE系统(称为OSESEM)方法(称为OSESEM),目的是实现SE模型的适应,OSSEM由经修改的变压器SE网络和一个针对特定发言者的遮罩(SSM)网络组成。在实践中,SSM网络吸收了一名注册的演讲者,该演讲者利用ECAPA-TDN网进行嵌入,以通过遮蔽来调整输入的噪音特征。为了评估OSSEM,我们设计了一个经过修改的语音银行-DEMAND数据集,其中测试集的一个词用于模式适应,其余的语句用于测试性能。此外,我们设置了限制,允许实时进行增强进程,从而将OSEM设计成一个因果的SE系统。实验结果首先表明,OSSEM能够有效地将事先培训的SE模型改换成一个特定发言者,只有一句话,从而产生更好的SEAR-res的竞争力。