Existing deep learning based speech enhancement (SE) methods either use blind end-to-end training or explicitly incorporate speaker embedding or phonetic information into the SE network to enhance speech quality. In this paper, we perceive speech and noises as different types of sound events and propose an event-based query method for SE. Specifically, representative speech embeddings that can discriminate speech with noises are first pre-trained with the sound event detection (SED) task. The embeddings are then clustered into fixed golden speech queries to assist the SE network to enhance the speech from noisy audio. The golden speech queries can be obtained offline and generalizable to different SE datasets and networks. Therefore, little extra complexity is introduced and no enrollment is needed for each speaker. Experimental results show that the proposed method yields significant gains compared with baselines and the golden queries are well generalized to different datasets.
翻译:现有基于深层学习的语音强化(SE)方法要么使用盲端至端培训,要么将演讲者嵌入或语音信息明确纳入SE网络,以提高语言质量。在本文中,我们将言论和噪音视为不同类型的声音活动,并为SE提出基于事件的查询方法。具体地说,在对具有代表性的语音嵌入中,可以将声音与噪声区分开来,先先先先通过音效事件探测(SED)任务进行预先培训。然后将嵌入成固定的黄金语音查询,以帮助SE网络加强来自噪音的语音。黄金语音查询可以离线获取,并普遍用于不同的SE数据集和网络。因此,很少引入额外的复杂性,不需要每个演讲者注册。实验结果表明,与基线相比,拟议方法可产生显著的收益,黄金查询则被广泛推广到不同的数据集中。