In this work, we show that a factored hybrid hidden Markov model (FH-HMM) which is defined without any phonetic state-tying outperforms a state-of-the-art hybrid HMM. The factored hybrid HMM provides a link to transducer models in the way it models phonetic (label) context while preserving the strict separation of acoustic and language model of the hybrid HMM approach. Furthermore, we show that the factored hybrid model can be trained from scratch without using phonetic state-tying in any of the training steps. Our modeling approach enables triphone context while avoiding phonetic state-tying by a decomposition into locally normalized factored posteriors for monophones/HMM states in phoneme context. Experimental results are provided for Switchboard 300h and LibriSpeech. On the former task we also show that by avoiding the phonetic state-tying step, the factored hybrid can take better advantage of regularization techniques during training, compared to the standard hybrid HMM with phonetic state-tying based on classification and regression trees (CART).
翻译:在这项工作中,我们显示,在没有任何语音状态调试的情况下界定的因子混合隐藏马可夫模式(FH-HMM),在没有任何语音状态调试的情况下优于最先进的混合HMM。因子混合的HMMM以模拟语音(标签)背景的方式提供与转导模型的链接,同时保持混合HMM方法的声学和语言模式的严格分离。此外,我们显示,因子混合模式可以在任何培训步骤中不使用语音状态调试而从零开始训练。我们的建模方法通过在电话环境中将单声机/HMMM国家分解成当地正常的因子后台来避免三声调环境,同时通过分解成本地的单声带/HMMMM后台来避免断音状态调。为300h交换机和LibriSpeech提供了实验结果。关于前一项任务,我们还表明,通过避免电话状态调控步骤,因子混合在培训中可以更好地利用正规化技术,而与基于分类和倒退树的语音调控管的标准混合HMMMMMM(C)。