Existing self-supervised pre-trained speech models have offered an effective way to leverage massive unannotated corpora to build good automatic speech recognition (ASR). However, many current models are trained on a clean corpus from a single source, which tends to do poorly when noise is present during testing. Nonetheless, it is crucial to overcome the adverse influence of noise for real-world applications. In this work, we propose a novel training framework, called deHuBERT, for noise reduction encoding inspired by H. Barlow's redundancy-reduction principle. The new framework improves the HuBERT training algorithm by introducing auxiliary losses that drive the self- and cross-correlation matrix between pairwise noise-distorted embeddings towards identity matrix. This encourages the model to produce noise-agnostic speech representations. With this method, we report improved robustness in noisy environments, including unseen noises, without impairing the performance on the clean set.
翻译:现有的自我监督的经过培训的演讲模式提供了一种有效的途径,可以借助大量未经附加说明的团体来建立良好的自动语音识别(ASR),然而,许多现有模式都是从单一来源获得清洁材料的培训,在测试过程中,当噪音出现时,这种培训往往效果不佳,然而,克服噪音对现实世界应用的消极影响至关重要。在这项工作中,我们提议了一个名为deHuBERT的新培训框架,用于根据H. Barlow的裁员原则来减少噪音编码。新的框架通过引入驱动对称噪音扭曲嵌入到身份矩阵之间的自我和交叉关系矩阵的辅助损失,改进了HuBERT培训算法。这鼓励了制作噪音无声化语音表的模式。用这种方法,我们报告在噪音环境,包括看不见的噪音中提高了稳健性,但又不损害在清洁设备上的表现。</s>