Multimodal learning has been proven to be an effective method to improve speech enhancement (SE) performance, especially in challenging situations such as low signal-to-noise ratios, speech noise, or unseen noise types. In previous studies, several types of auxiliary data have been used to construct multimodal SE systems, such as lip images, electropalatography, or electromagnetic midsagittal articulography. In this paper, we propose a novel EMGSE framework for multimodal SE, which integrates audio and facial electromyography (EMG) signals. Facial EMG is a biological signal containing articulatory movement information, which can be measured in a non-invasive way. Experimental results show that the proposed EMGSE system can achieve better performance than the audio-only SE system. The benefits of fusing EMG signals with acoustic signals for SE are notable under challenging circumstances. Furthermore, this study reveals that cheek EMG is sufficient for SE.
翻译:事实证明,多式学习是提高语音增强性能的有效方法,特别是在低信号到噪音比率、语音噪音或无形噪音类型等具有挑战性的情况下;在以前的研究中,已经使用几种辅助数据来建造多式SE系统,例如唇图、电镀法或电磁中程动脉学;在本文中,我们提议为Monddal SE建立一个新的EGSE框架,将音频和面部电动学信号(EMG)整合在一起;双环球电感应是一种生物信号,含有动动信息,可以用非侵入性方式测量;实验结果显示,拟议的EMGSE系统可以比只听音的SE系统取得更好的性能;在具有挑战性的情况下,使用EMG信号为SE带来显著的好处;此外,这项研究还表明,用EMG面膜对SE来说已经足够了。