This paper presents a novel natural gradient and Hessian-free (NGHF) optimisation framework for neural network training that can operate efficiently in a distributed manner. It relies on the linear conjugate gradient (CG) algorithm to combine the natural gradient (NG) method with local curvature information from Hessian-free (HF) or other second-order methods. A solution to a numerical issue in CG allows effective parameter updates to be generated with far fewer CG iterations than usually used (e.g. 5-8 instead of 200). This work also presents a novel preconditioning approach to improve the progress made by individual CG iterations for models with shared parameters. Although applicable to other training losses and model structures, NGHF is investigated in this paper for lattice-based discriminative sequence training for hybrid hidden Markov model acoustic models using a standard recurrent neural network, long short-term memory, and time delay neural network models for output probability calculation. Automatic speech recognition experiments are reported on the multi-genre broadcast data set for a range of different acoustic model types. These experiments show that NGHF achieves larger word error rate reductions than standard stochastic gradient descent or Adam, while requiring orders of magnitude fewer parameter updates.
翻译:本文为神经网络培训提供了一个新的自然坡度和无海珊(NGHF)优化框架,可以以分布方式有效运行。它依靠线性同化梯度算法,将自然梯度方法与海珊(HF)或其他二阶方法的本地弯曲信息相结合。对于CG中的数字问题,一个解决方案允许以比通常使用的更小得多的CG迭代(例如5-8而不是200)生成有效的参数更新。这项工作还提出了改进单个CG迭代为共享参数模型所取得进展的新的先决条件。虽然适用于其他培训损失和模型结构,但本文对NGHF进行了调查,目的是利用标准的经常性神经网络、长期短期记忆和延迟的神经网络模型,对混合隐含的Markov模型进行基于歧视顺序的培训,用于计算输出概率。在多种声学模型类型的多基因广播数据集上报告了自动语音识别实验。这些实验显示,在需要更高等级的LADGHF值、更低级级或更低级级标准降级时,NGHFFF值将达到比标准降级。