In this paper, we present a deep learning-based speech signal-processing mobile application, CITISEN, which can perform three functions: speech enhancement (SE), acoustic scene conversion (ASC), and model adaptation (MA). For SE, CITISEN can effectively reduce noise components from speech signals and accordingly enhance their clarity and intelligibility. For ASC, CITISEN can convert the current background sound to a different background sound. Finally, for MA, CITISEN can effectively adapt an SE model, with a few audio files, when it encounters unknown speakers or noise types; the adapted SE model is used to enhance the upcoming noisy utterances. Experimental results confirmed the effectiveness of CITISEN in performing these three functions via objective evaluation and subjective listening tests. The promising results reveal that the developed CITISEN mobile application can potentially be used as a front-end processor for various speech-related services such as voice communication, assistive hearing devices, and virtual reality headsets.
翻译:在本文中,我们展示了一种深层次的基于学习的语音信号处理移动应用程序,即CITISEN,它可以发挥三种功能:语音增强(SE)、声场转换(ASC)和模型适应(MA)。对于SE,CITISEN可以有效地减少语音信号中的噪音成分,从而提高它们的清晰度和智能度。对于ASC,CITISEN可以将当前背景声音转换为不同的背景声音。最后,对于MA来说,CITISEN可以有效地调整SE模型,在遇到未知的发言者或噪音类型时有少数音频文件;经调整的SE模型用来加强即将出现的噪音发音。实验结果证实了CITISEN通过客观评估和主观听觉测试来履行这三种功能的有效性。有希望的结果显示,开发的CITISEN移动应用程序可以被用作语音通信、辅助听力装置和虚拟现实头盔等各种语音相关服务的前端处理器。