Recently, self-supervised learning (SSL) techniques have been introduced to solve the monaural speech enhancement problem. Due to the lack of using clean phase information, the enhancement performance is limited in most SSL methods. Therefore, in this paper, we propose a phase-aware self-supervised learning based monaural speech enhancement method. The latent representations of both amplitude and phase are studied in two decoders of the foundation autoencoder (FAE) with only a limited set of clean speech signals independently. Then, the downstream autoencoder (DAE) learns a shared latent space between the clean speech and mixture representations with a large number of unseen mixtures. A complex-cycle-consistent (CCC) mechanism is proposed to minimize the reconstruction loss between the amplitude and phase domains. Besides, it is noticed that if the speech features are extracted as the multi-resolution spectra, the desired information distributed in spectra of different scales can be studied to further boost the performance. The NOISEX and DAPS corpora are used to generate mixtures with different interferences to evaluate the efficacy of the proposed method. It is highlighted that the clean speech and mixtures fed in FAE and DAE are not paired. Both ablation and comparison experimental results show that the proposed method clearly outperforms the state-of-the-art approaches.
翻译:最近,引入了自我监督学习(SSL)技术以解决调音强化问题。由于缺乏使用清洁阶段信息,大多数 SLL 方法中增强性能有限。因此,我们在本文件中提议采用基于调音强化法的逐步自监管学习自监管语言强化方法,在自动调音强化法基金会的两个解码器中研究振幅和阶段的潜在表现,独立地仅使用一套有限的清洁语音信号。然后,下游自动调解调器(DAE)学习清洁言语和混合表达形式与大量无形混合物之间共享的潜在空间。建议采用一个复杂的周期一致机制,以尽量减少振动和阶段语音增强法之间的重建损失。此外,人们注意到,如果调音特征作为多分辨率光谱提取,则可以研究不同尺度光谱中传播的预期信息,以进一步提升性能。NOISEX 和 DAPS 公司被用于生成含有不同干扰的混合物,以评价拟议方法的功效。在AFA-R-FA-FA-FA-FA-FA-FA-FA-FA-FA-FA-FA-FA-FA-FA-FS-C-FS-C-FS-FS-FS-FS-FS-B-FS-FS-FS-FS-C-FS-C-C-C-C-C-C-C-C-C-C-C-C-FS-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-FAT-C-FAFAFAFAFAFAFA-FA-FA-FAFA-FA-FAFAFA-FA-FAFAFA-FA-FA-FA-FA-FA-FA-FA-FA-FA-FAFAFA-FAFA-FA-FA-FA-FA-FA-FA-F