Self-supervised representations have been extensively studied for discriminative and generative tasks. However, their robustness capabilities have not been extensively investigated. This work focuses on self-supervised representations for spoken generative language models. First, we empirically demonstrate how current state-of-the-art speech representation models lack robustness to basic signal variations that do not alter the spoken information. To overcome this, we propose an effective and efficient method to learn robust self-supervised speech representation for generative spoken language modeling. The proposed approach is based on applying a set of signal transformations to the speech signal and optimizing the model using an iterative pseudo-labeling scheme. Our method significantly improves over the evaluated baselines when considering encoding metrics. We additionally evaluate our method on the speech-to-speech translation task. We consider Spanish-English and French-English conversions and empirically demonstrate the benefits of following the proposed approach.
翻译:对自我监督的表述方式进行了广泛的研究,以开展歧视性和基因化工作,然而,对自我监督的表述方式进行了广泛的研究,但尚未对其稳健性能力进行广泛调查。这项工作侧重于对口述基因化语言模型进行自我监督的表述方式。首先,我们从经验上表明,目前最先进的语音表述模式对于基本信号差异而言,对于不会改变口述信息的基本信号差异而言,是如何缺乏稳健性的。为了克服这一点,我们提出了一个切实有效的方法,用于学习强健的自我监督的语音表述方式,以进行基因化的口语模型。拟议方法的基础是对语音信号进行一套信号转换,并利用迭代伪标签办法优化模式。我们的方法在考虑编码指标时大大改进了评估基线。我们进一步评估了语音对口语翻译工作的方法。我们认为,西班牙语英语和法语英语的转换方式,并用经验展示了遵循拟议方法的好处。