We introduce a self-supervised speech pre-training method called TERA, which stands for Transformer Encoder Representations from Alteration. Recent approaches often learn by using a single auxiliary task like contrastive prediction, autoregressive prediction, or masked reconstruction. Unlike previous methods, we use alteration along three orthogonal axes to pre-train Transformer Encoders on a large amount of unlabeled speech. The model learns through the reconstruction of acoustic frames from their altered counterpart, where we use a stochastic policy to alter along various dimensions: time, frequency, and magnitude. TERA can be used for speech representations extraction or fine-tuning with downstream models. We evaluate TERA on several downstream tasks, including phoneme classification, keyword spotting, speaker recognition, and speech recognition. We present a large-scale comparison of various self-supervised models. TERA achieves strong performance in the comparison by improving upon surface features and outperforming previous models. In our experiments, we study the effect of applying different alteration techniques, pre-training on more data, and pre-training on various features. We analyze different model sizes and find that smaller models are strong representation learners than larger models, while larger models are more effective for downstream fine-tuning than smaller models. Furthermore, we show the proposed method is transferable to downstream datasets not used in pre-training.
翻译:我们引入了一种自我监督的演讲预培训方法,即TERA,它代表了变异器编码代表器的改造。最近的方法往往通过使用一个单一辅助任务来学习,例如对比预测、自动递减预测或蒙面重建。与以往的方法不同,我们使用三个正方形轴的改变,用大量未贴标签的演讲进行预培训变异器编译。模型从变异的对应方那里学习了声学框架,我们用一种随机政策来改变不同的层面:时间、频率和规模。TERA可用于语音代表提取或与下游模型进行微调。我们评估了TERA的几项下游任务,包括电话分类、关键词识别、扬声器识别和语音识别。我们对各种自我监督的模型进行了大规模比较。TERA通过改进表面特征和优异的以往模型,在比较中取得了很强的成绩。在我们的实验中,我们研究了应用不同的改变技术、对更多数据进行预先培训以及各种特征的预培训的效果。我们分析了不同模型的下游任务,包括电话分类、关键识别器、语音识别器的大小,而我们使用的下游模型则显示,而采用较小型模型是较强的。