In this study, we propose a cross-domain multi-objective speech assessment model, i.e., the MOSA-Net, which can estimate multiple speech assessment metrics simultaneously. More specifically, the MOSA-Net is designed to estimate speech quality, intelligibility, and distortion assessment scores based on a test speech signal as input. It comprises a convolutional neural network and bidirectional long short-term memory (CNN-BLSTM) architecture for representation extraction, as well as a multiplicative attention layer and a fully-connected layer for each assessment metric. In addition, cross-domain features (spectral and time-domain features) and latent representations from self-supervised learned models are used as inputs to combine rich acoustic information from different speech representations to obtain more accurate assessments. Experimental results reveal that the MOSA-Net can precisely predict perceptual evaluation of speech quality (PESQ), short-time objective intelligibility (STOI), and speech distortion index (SDI) scores when tested on both noisy and enhanced speech utterances under either seen test conditions (where the test speakers and noise types are involved in the training set) or unseen test conditions (where the test speakers and noise types are not involved in the training set). In light of the confirmed prediction capability, we further adopt the latent representations of the MOSA-Net to guide the speech enhancement (SE) process and derive a quality-intelligibility (QI)-aware SE (QIA-SE) approach accordingly. Experimental results show that QIA-SE provides superior enhancement performance compared with the baseline SE system in terms of objective evaluation metrics and qualitative evaluation test.
翻译:在本研究中,我们提出了一个跨部多目标语音评估模型,即MOSA-Net,可以同时估计多种语音评估指标;更具体地说,MOSA-Net旨在根据测试语音信号作为输入,估算语音质量、智能和扭曲评估分数;它包括一个动态神经网络和双向长期短期内存(CNN-BLSTM)代表提取结构,以及一个重复关注层和每个评估指标的完全连接层;此外,跨部特征(光谱和时空特征)和由自我监督的学习模型产生的潜在表现,被用作投入,将不同语音表述的丰富的声学信息结合起来,以获得更准确的评估;实验结果表明,MOSA-Net可以准确预测对语音质量的感知性评价(PESQ)、短期目标感知性(STOI)和语音扭曲指数(SDI),在所看到的测试条件下(在测试质量和时间-时间-时间-时间-时间-时间-时间-时间-时间-时间-时间-时间-时间-时间-时间-时间-时间-时间-时间-时间-时间-时间-时间-时间-时间-时间-时间-时间-时间-时间-时间-时间-时间-时间-时间-时间-时间-时间-时间-时间-时间-时间-时间-时间-时间-时间-时间-时间-时间-时间-时间-时间-时间-时间-时间-时间-时间-时间-时间-时间-时间-时间-时间-时间-时间-时间-时间-时间-时间-时间-时间-时间-时间-时间-时间-时间-时间-时间-时间-时间-时间-时间-时间-时间-时间-时间-时间-时间-时间-时间-时间-时间-时间-时间-时间-时间-时间-时间-时间-时间-时间-时间-时间-时间-时间-时间-时间-时间-时间-时间-时间-时间-时间-时间-时间-时间-时间-时间-时间-时间-时间-时间-时间-时间-时间-时间-时间-时间-时间-时间-时间-时间-时间-时间-时间-时间-时间-时间-时间-时间-时间-时间-时间-时间-时间-时间-时间-时间-时间-时间-时间-时间-时间-时间-时间-时间-时间-时间-时间