Current deep learning (DL) based approaches to speech intelligibility enhancement in noisy environments are generally trained to minimise the distance between clean and enhanced speech features. These often result in improved speech quality however they suffer from a lack of generalisation and may not deliver the required speech intelligibility in everyday noisy situations. In an attempt to address these challenges, researchers have explored intelligibility-oriented (I-O) loss functions to train DL approaches for robust speech enhancement (SE). In this paper, we formulate a novel canonical correlation-based I-O loss function to more effectively train DL algorithms. Specifically, we present a fully convolutional SE model that uses a modified canonical-correlation based short-time objective intelligibility (CC-STOI) metric as a training cost function. To the best of our knowledge, this is the first work that exploits the integration of canonical correlation in an I-O based loss function for SE. Comparative experimental results demonstrate that our proposed CC-STOI based SE framework outperforms DL models trained with conventional STOI and distance-based loss functions, in terms of both standard objective and subjective evaluation measures when dealing with unseen speakers and noises.
翻译:为了应对这些挑战,研究人员探索了以智能为导向的损失功能,以训练DL在噪音环境中强化言语感知能力。在本文中,我们制定了一个新的基于CA-STOI的互连互连互换损失功能,以更有效地培训DL算法。具体地说,我们提出了一个完全进化的SE模型,使用基于短时目标感知性(CC-STOI)的修改的CAN-orl(CC-STOI)衡量标准,作为培训成本功能。根据我们的最佳知识,这是首次利用CA-SE在基于I-O的损失功能中结合CANS-STOI的因果关系。比较实验结果表明,我们提议的以CC-STOI为基础的SE框架,在标准目标和主观性评估时,在与基于常规STOI和远程的磁性措施打交道时,将DL模型与常规STOI发言人和基于远程损失功能相匹配。