Pre-trained speech Transformers have facilitated great success across various speech processing tasks. However, fine-tuning these encoders for downstream tasks require sufficiently large training data to converge or to achieve state-of-the-art. In text domain this has been partly attributed to sub-optimality of the representation space in pre-trained Transformers. In this work, we take a sober look into pre-trained speech encoders and rewire their representation space without requiring any task-specific labels. Our method utilises neutrally synthesised version of audio inputs along with frame masking to construct positive pairs for contrastive self-supervised learning. When used for augmenting the wav2vec 2 encoder, we observe consistent improvement of isotropy in the representation space. Our experiments on 6 speech processing tasks, exhibit a significant convergence speedup during task fine-tuning as well as consistent task improvement, specially in low-resource settings.
翻译:培训前的语音变换器有助于在各种语言处理任务中取得巨大成功,但是,为下游任务对这些编码器进行微调需要足够大的培训数据,以汇集或达到最新水平。在文本领域,这部分归因于培训前的变换器中代表空间的亚最佳性。在这项工作中,我们清醒地审视了预先培训的语音变换器,并在不需要任何特定任务标签的情况下将其代表空间重新连接起来。我们的方法使用中性合成的音频输入版本,同时使用框架掩码,为对比性自我监督的学习构建正面对子。当用于增强 wav2vec 2 编码器时,我们观察到代表空间的异质性在持续改善。我们在6个语音处理任务上进行的实验,在任务微调和持续任务改进过程中表现出显著的趋同速度,特别是在低资源环境下。