CITISEN: 深入学习的语音信号处理移动应用 (CITISEN: A Deep Learning-Based Speech Signal-Processing Mobile Application)

This study presents a deep learning-based speech signal-processing mobile application known as CITISEN. The CITISEN provides three functions: speech enhancement (SE), model adaptation (MA), and background noise conversion (BNC), allowing CITISEN to be used as a platform for utilizing and evaluating SE models and flexibly extend the models to address various noise environments and users. For SE, a pretrained SE model downloaded from the cloud server is used to effectively reduce noise components from instant or saved recordings provided by users. For encountering unseen noise or speaker environments, the MA function is applied to promote CITISEN. A few audio samples recording on a noisy environment are uploaded and used to adapt the pretrained SE model on the server. Finally, for BNC, CITISEN first removes the background noises through an SE model and then mixes the processed speech with new background noise. The novel BNC function can evaluate SE performance under specific conditions, cover people's tracks, and provide entertainment. The experimental results confirmed the effectiveness of SE, MA, and BNC functions. Compared with the noisy speech signals, the enhanced speech signals achieved about 6\% and 33\% of improvements, respectively, in terms of short-time objective intelligibility (STOI) and perceptual evaluation of speech quality (PESQ). With MA, the STOI and PESQ could be further improved by approximately 6\% and 11\%, respectively. Finally, the BNC experiment results indicated that the speech signals converted from noisy and silent backgrounds have a close scene identification accuracy and similar embeddings in an acoustic scene classification model. Therefore, the proposed BNC can effectively convert the background noise of a speech signal and be a data augmentation method when clean speech signals are unavailable.

翻译：CITISEN提供三种功能:语音增强(SE)、模型适应(MA)和背景噪音转换(BNC),使CITISEN能够用作使用和评价SE模型的平台,并灵活扩展模型,以处理各种噪音环境和用户。SE对SE来说,从云服务器下载的预先培训的SE模型可以有效地减少用户提供的即时或保存录音中的噪音成分。在遇到看不见的噪音或音频环境时,MA函数被用于促进CITISEN。在噪音环境中的一些声音样本被上传并用于调整服务器上预先训练的SE模型。最后,CITISEN首先通过S模型消除背景噪音,然后将处理过的演讲与新的背景噪音混为一体。BNC函数可以在特定条件下评估SE的性能,覆盖人们的轨道,并提供娱乐。实验结果证实SE、MA和BNC的递解功能的有效性。比较了音频的音响信号、声音转换的增强的语音信号在网络上有效转换。对于BNCRQ来说,在S的精确度上,可以分别通过S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-I-I-I-I-I-I-L-L-L-L-L-L-L-L-L-L-L-L-L-L-S-S-S-S-S-L-L-L-S-L-L-L-S-S-S-L-L-L-L-L-L-L-L-L-S-S-S-S-L-L-S-L-L-S-L-S-S-L-L-L-L-L-L-L-L-L-L-S-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-S-L-L-L-L-L-L-L-L-L-L-L-L-L-S-S-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-