We introduce a new approach for speech pre-training named SPIRAL which works by learning denoising representation of perturbed data in a teacher-student framework. Specifically, given a speech utterance, we first feed the utterance to a teacher network to obtain corresponding representation. Then the same utterance is perturbed and fed to a student network. The student network is trained to output representation resembling that of the teacher. At the same time, the teacher network is updated as moving average of student's weights over training steps. In order to prevent representation collapse, we apply an in-utterance contrastive loss as pre-training objective and impose position randomization on the input to the teacher. SPIRAL achieves competitive or better results compared to state-of-the-art speech pre-training method wav2vec 2.0, with significant reduction of training cost (80% for Base model, 65% for Large model). Furthermore, we address the problem of noise-robustness that is critical to real-world speech applications. We propose multi-condition pre-training by perturbing the student's input with various types of additive noise. We demonstrate that multi-condition pre-trained SPIRAL models are more robust to noisy speech (9.0% - 13.3% relative word error rate reduction on real noisy test data), compared to applying multi-condition training solely in the fine-tuning stage. The code will be released after publication.
翻译:我们引入了一个新的语言预培训方法,名为SPIRAL, 通过在教师-学生框架内学习对受扰数据进行分解的表达方式。 具体地讲, 我们首先向教师网络提供这种表达方式, 以获得相应的代表方式。 然后, 同样的表达方式被扰动, 并提供给学生网络。 学生网络经过培训, 其产出代表方式与教师相似。 同时, 教师网络被更新为学生比培训步骤的比重的移动平均值。 为了防止代表性的崩溃, 我们应用一种不适的对比性损失作为培训前的目标, 对教师的投入实行随机化。 SPIRAL与最先进的语言预培训方法 wav2vec 2.0 相比,我们取得了竞争或更好的结果, 培训成本大幅降低( 基地模型为80%, 大模型为65% ) 。 此外, 我们解决了对现实世界演讲应用至关重要的噪音破坏力问题。 我们提出多项培训前建议, 通过对学生模型进行反复调整, 将学生的演讲阶段投入与各种高压度标准相比, 我们用13个标准前的校准度数据进行多种测试。