Stream fusion, also known as system combination, is a common technique in automatic speech recognition for traditional hybrid hidden Markov model approaches, yet mostly unexplored for modern deep neural network end-to-end model architectures. Here, we investigate various fusion techniques for the all-attention-based encoder-decoder architecture known as the transformer, striving to achieve optimal fusion by investigating different fusion levels in an example single-microphone setting with fusion of standard magnitude and phase features. We introduce a novel multi-encoder learning method that performs a weighted combination of two encoder-decoder multi-head attention outputs only during training. Employing then only the magnitude feature encoder in inference, we are able to show consistent improvement on Wall Street Journal (WSJ) with language model and on Librispeech, without increase in runtime or parameters. Combining two such multi-encoder trained models by a simple late fusion in inference, we achieve state-of-the-art performance for transformer-based models on WSJ with a significant WER reduction of 19\% relative compared to the current benchmark approach.
翻译:串流聚合,又称系统组合,是传统混合隐藏的Markov模型方法自动语音识别的一种常见技术,但大多没有探索现代深神经网络终端到终端模型结构。在这里,我们调查了被称为变压器的全心致电编码脱coder结构的各种聚合技术,通过在具有标准大小和阶段特性结合的单一麦克风样样板环境中调查不同的聚合水平,力求实现最佳融合。我们采用了一种新型的多编码学习方法,在培训期间将两种编码-脱coder多主关注输出进行加权组合。我们当时只使用引文中的大小地物编码器,我们能够在华尔街日报(WSJ)和Librispeech上显示一致的改进,而不会增加运行时间或参数。我们通过简单的延迟融合,将两种经过多编码训练的模型合并在一起,我们实现了WSJ基于变压模型的先进性性能,与目前的基准方法相比,WEJ的WER相对比减少19。