Speech pre-training has shown great success in learning useful and general latent representations from large-scale unlabeled data. Based on a well-designed self-supervised learning pattern, pre-trained models can be used to serve lots of downstream speech tasks such as automatic speech recognition. In order to take full advantage of the labed data in low resource task, we present an improved pre-training method by introducing a supervision-enhanced acoustic unit (SEAU) pattern to intensify the expression of comtext information and ruduce the training cost. Encoder representations extracted from the SEAU pattern are used to generate more representative target units for HuBERT pre-training process. The proposed method, named SeHuBERT, achieves a relative word error rate reductions of 10.5% and 4.9% comared with the standard HuBERT on Turkmen speech recognition task with 500 hours and 100 hours fine-tuning data respectively. Extended to more languages and more data, SeHuBERT can aslo achieve a relative word error rate reductions of approximately 10% at half of the training cost compared with HuBERT.
翻译:培训前的演讲在从大规模无标签数据中学习有用和一般的潜在表现方面表现出极大的成功。根据精心设计的自我监督的学习模式,预先培训的模式可用于许多下游演讲任务,如自动语音识别。为了充分利用低资源任务中的实验室数据,我们提出了一个改进的培训前方法,即引入一个强化监督的声学单位模式,以加强读音信息的表达,并降低培训费用。从SEAU模式中提取的读音器用于为HuBERT培训前过程产生更具代表性的目标单位。拟议的方法名为SeHuBERT, 与土库曼语音识别标准HuBERT相比,分别以500小时和100小时的微调数据,实现了10.5%和4.9%的相对字误差率下降。SeHuBERT可以扩大到更多语言和更多的数据。与HuBERT相比,其培训费用的一半相对减少了10%的字差率。