This paper describes our submission to the Second Clarity Enhancement Challenge (CEC2), which consists of target speech enhancement for hearing-aid (HA) devices in noisy-reverberant environments with multiple interferers such as music and competing speakers. Our approach builds upon the powerful iterative neural/beamforming enhancement (iNeuBe) framework introduced in our recent work, and this paper extends it for target speaker extraction. We therefore name the proposed approach as iNeuBe-X, where the X stands for extraction. To address the challenges encountered in the CEC2 setting, we introduce four major novelties: (1) we extend the state-of-the-art TF-GridNet model, originally designed for monaural speaker separation, for multi-channel, causal speech enhancement, and large improvements are observed by replacing the TCNDenseNet used in iNeuBe with this new architecture; (2) we leverage a recent dual window size approach with future-frame prediction to ensure that iNueBe-X satisfies the 5 ms constraint on algorithmic latency required by CEC2; (3) we introduce a novel speaker-conditioning branch for TF-GridNet to achieve target speaker extraction; (4) we propose a fine-tuning step, where we compute an additional loss with respect to the target speaker signal compensated with the listener audiogram. Without using external data, on the official development set our best model reaches a hearing-aid speech perception index (HASPI) score of 0.942 and a scale-invariant signal-to-distortion ratio improvement (SI-SDRi) of 18.8 dB. These results are promising given the fact that the CEC2 data is extremely challenging (e.g., on the development set the mixture SI-SDR is -12.3 dB). A demo of our submitted system is available at WAVLab CEC2 demo.
翻译:本文介绍我们提交第二提高清晰度挑战(CEC2)的情况,该挑战包括:为听力援助(HA)设备在噪音反动环境中增强语音装置的目标,包括音乐和相互竞争的演讲者等多个干扰器。我们的方法以我们最近工作中引入的强大的迭代神经/波形增强(iNeube)框架为基础,本文将其扩展为目标扬声器提取。因此,我们将拟议方法命名为iNeuBe-X,其中X代表提取。为了应对CEC2设置中遇到的挑战,我们推出了四大新颖:(1) 我们扩展了最先进的TF-GridNet模型,该模型最初设计用于音频分离,用于多频道、因果语音增强和大幅改进,通过替换iNeueube中使用的TCNDENet;(2) 我们利用最近的双窗口规模方法来进行未来框架预测,以确保 iNueB-X能够满足CEC2所要求的对算法变异度的5 ms 限制;(3) 我们引入了一个新的演讲者-CR-DR-DR 目标升级部门,其中显示我们最新的音频变换的SDRADR数据。