Numerous studies have investigated the effectiveness of audio-visual multimodal learning for speech enhancement (AVSE) tasks, seeking a solution that uses visual data as auxiliary and complementary input to reduce the noise of noisy speech signals. Recently, we proposed a lite audio-visual speech enhancement (LAVSE) algorithm. Compared to conventional AVSE systems, LAVSE requires less online computation and moderately solves the user privacy problem on facial data. In this study, we extend LAVSE to improve its ability to address three practical issues often encountered in implementing AVSE systems, namely, the requirement for additional visual data, audio-visual asynchronization, and low-quality visual data. The proposed system is termed improved LAVSE (iLAVSE), which uses a convolutional recurrent neural network architecture as the core AVSE model. We evaluate iLAVSE on the Taiwan Mandarin speech with video dataset. Experimental results confirm that compared to conventional AVSE systems, iLAVSE can effectively overcome the aforementioned three practical issues and can improve enhancement performance. The results also confirm that iLAVSE is suitable for real-world scenarios, where high-quality audio-visual sensors may not always be available.
翻译:多项研究调查了视听多式学习促进语音增强(AVSE)任务的效果,寻求一种解决办法,利用视觉数据作为辅助和补充性投入,减少噪音语音信号的噪音。最近,我们提议采用立体视听语音增强算法。与常规的AVSE系统相比,LAVSE要求较少的在线计算,并适度解决面部数据方面的用户隐私问题。在这项研究中,我们扩展LAVSE,以提高其处理实施AVSE系统时经常遇到的三个实际问题的能力,即需要更多视觉数据、视听同步化和低质量的视觉数据。我们提议的系统称为改进LAVSE(iLAVSE)系统,它使用动态的经常性神经网络结构作为AVSE的核心模型。我们用视频数据集对台湾曼达林话上的iLAVSE进行了评估。实验结果证实,与常规的AVSE系统相比, iLAVSE能够有效克服上述三个实际问题,并能提高性能。结果还证实,iLAVSEEEAVSE系统对于真实世界情景来说,那里的高品质传感器可能永远不合适。