In this study, we propose a cross-domain multi-objective speech assessment model called MOSA-Net, which can estimate multiple speech assessment metrics simultaneously. More specifically, MOSA-Net is designed to estimate the speech quality, intelligibility, and distortion assessment scores of an input test speech signal. It comprises a convolutional neural network and bidirectional long short-term memory (CNN-BLSTM) architecture for representation extraction, and a multiplicative attention layer and a fully-connected layer for each assessment metric. In addition, cross-domain features (spectral and time-domain features) and latent representations from self-supervised learned models are used as inputs to combine rich acoustic information from different speech representations to obtain more accurate assessments. Experimental results show that MOSA-Net can precisely predict perceptual evaluation of speech quality (PESQ), short-time objective intelligibility (STOI), and speech distortion index (SDI) scores when tested on noisy and enhanced speech utterances under either seen test conditions or unseen test conditions. Moreover, MOSA-Net, originally trained to assess objective scores, can be used as a pre-trained model to be effectively adapted to an assessment model for predicting subjective quality and intelligibility scores with a limited amount of training data. In light of the confirmed prediction capability, we further adopt the latent representations of MOSA-Net to guide the speech enhancement (SE) process and derive a quality-intelligibility (QI)-aware SE (QIA-SE) approach accordingly. Experimental results show that QIA-SE provides superior enhancement performance compared with the baseline SE system in terms of objective evaluation metrics and qualitative evaluation test.
翻译:在本研究中,我们提出一个跨部多目标语言评估模型,称为MOSA-Net,可以同时估计多种语言评估指标;更具体地说,MOSA-Net旨在估计一个输入测试语音信号的语音质量、智能和扭曲评估分数;它包括一个用于代表提取的进化神经网络和双向长期短期记忆(CNN-BLSTM)结构,一个多复制关注层和每个评估指标的全连层;此外,跨部特征(光谱和时常识特征)和自我监督的学习模型的潜在表现被作为投入,用于将不同语音演示的丰富声学信息结合起来,以获得更准确的评估;实验结果表明,MOSA-Net可以准确地预测对语言质量的感知性评估(PESQ),短期目标不易见目标,语言扭曲指数(SDI),在通过测试条件或可见的测试条件下测试音响和强化的言词表达力;此外,MOSA-Net,最初经过培训的对客观评分数进行比较,在SESE-RI标准中,可以有效地将SE-Sealalalal-realalalalal-ass revial revial revial revial a a a devial devial devial deal devial deal deal deal deal a la a la a int a int a int a int a int a la a livialvial deal devial deal deal deal devial devial devial a la a int a int lamental devial devalvial deal devial devial deal deal deal deal deal deal deal deal deal deal a la a la a lamental deal deal deal deal deal deal deal deal deal deal a la lactional deal deal a lactional lactional deal deal deal deal deal deal deal deal deal deal a laction lament a la la la la la la la la la la la la ladal a int a int a int la la la