To let the state-of-the-art end-to-end ASR model enjoy data efficiency, as well as much more unpaired text data by multi-modal training, one needs to address two problems: 1) the synchronicity of feature sampling rates between speech and language (aka text data); 2) the homogeneity of the learned representations from two encoders. In this paper we propose to employ a novel bidirectional attention mechanism (BiAM) to jointly learn both ASR encoder (bottom layers) and text encoder with a multi-modal learning method. The BiAM is to facilitate feature sampling rate exchange, realizing the quality of the transformed features for the one kind to be measured in another space, with diversified objective functions. As a result, the speech representations are enriched with more linguistic information, while the representations generated by the text encoder are more similar to corresponding speech ones, and therefore the shared ASR models are more amenable for unpaired text data pretraining. To validate the efficacy of the proposed method, we perform two categories of experiments with or without extra unpaired text data. Experimental results on Librispeech corpus show it can achieve up to 6.15% word error rate reduction (WERR) with only paired data learning, while 9.23% WERR when more unpaired text data is employed.
翻译:为使最先进的端到端 ASR 模型享有数据效率,以及通过多模式培训获得更多未受重视的文本数据,需要解决两个问题:(1) 语音和语言之间特征抽样率同步(读文本数据);(2) 由两个编码器所学习到的表达方式的同质性。在本文件中,我们提议采用一个新的双向注意机制(BiAM),以多式学习方法共同学习 ASR 编码器(下层)和文本编码器。BiAM 是为了便利特征采样率交换,实现在另一空间测量的一种类型的变换功能的质量,并具有多样化的客观功能。因此,语音表达方式增加了语言信息,而文本编码器生成的表达方式与相应的表达方式更为相似,因此,共享的 ASR 模型更便于对未受控的文本数据进行预先培训。为了验证拟议方法的功效,我们用或不用额外未受调的文本数据来进行两类实验,实现一种类型的变换的变换功能的质量,在使用WERRF数据时,只能将实验性结果与LIR23 的减缩率。