Segmental models are sequence prediction models in which scores of hypotheses are based on entire variable-length segments of frames. We consider segmental models for whole-word ("acoustic-to-word") speech recognition, with the feature vectors defined using vector embeddings of segments. Such models are computationally challenging as the number of paths is proportional to the vocabulary size, which can be orders of magnitude larger than when using subword units like phones. We describe an efficient approach for end-to-end whole-word segmental models, with forward-backward and Viterbi decoding performed on a GPU and a simple segment scoring function that reduces space complexity. In addition, we investigate the use of pre-training via jointly trained acoustic word embeddings (AWEs) and acoustically grounded word embeddings (AGWEs) of written word labels. We find that word error rate can be reduced by a large margin by pre-training the acoustic segment representation with AWEs, and additional (smaller) gains can be obtained by pre-training the word prediction layer with AGWEs. Our final models improve over prior A2W models.
翻译:线段模型是序列预测模型,其数个假设是以整个框架的变长部分为基础。 我们考虑全词(“声学到字”)语音识别的区段模型, 其特性矢量使用区块的矢量嵌入来定义。 这些模型在计算上具有挑战性, 因为路径数与词汇大小成正比, 其数量可能大于使用电话等子字单位时的数值。 我们描述端到端整字区段模型的有效方法, 其前向后和维特比解码, 在 GPU 上进行前向后和维特比解码, 简单段评分功能会降低空间复杂性。 此外, 我们通过联合培训的声词嵌入( AWES) 和以声基为基础的字嵌入字( AGWES ) 书面词标签来调查预培训前培训的预培训使用预课, 我们的最后模型改进了先前的 A2W 模型。 我们发现, 单词错误率可以通过与AW 预培训声段表示来大大降低。