Training recurrent neural networks is known to be difficult when time dependencies become long. Consequently, training standard gated cells such as the gated recurrent unit (GRU) and the long short-term memory (LSTM) on benchmarks where long-term memory is required remains an arduous task. In this work, we show that although most classical networks have only one stable equilibrium at initialisation, learning on tasks that require long-term memory only occurs once the number of network stable equilibria increases; a property known as multistability. Multistability is often not easily attained by initially monostable networks, making learning of long-term dependencies difficult. This insight leads to the design of a novel, general way to initialise any recurrent network connectivity through a procedure called "warmup" to improve its capability to learn arbitrarily long time dependencies. This initialisation procedure is designed to maximise network reachable multistability, i.e., the number of attractors within the network that can be reached through relevant input trajectories. Warming up is performed before training, using stochastic gradient descent on a specifically designed loss. We show on information restitution, sequence classification, and reinforcement learning benchmarks that warming up greatly improves recurrent neural network performance for multiple recurrent cell types, but sometimes impedes precision. We therefore introduce a parallel recurrent network structure with a partial warmup that is shown to greatly improve learning of long-term dependencies in sequences while maintaining high levels of precision. This approach provides a general framework for improving learning abilities of any recurrent cell type when long-term memory is required.
翻译:当时间依赖性变长时,人们知道培训经常性神经网络十分困难。 因此,培训标准门门式单元,如大门式经常单位(GRU)和长短期内存(LSTM)等关于需要长期内存的基准仍是一项艰巨的任务。 在这项工作中,我们表明,尽管大多数古典网络在初始化时只有一个稳定的平衡,但学习需要长期内存的任务只有在网络稳定平衡性增加之后才会发生;一种称为多功能性的财产。最初单数型网络往往不易实现多功能性。使长期性依赖性学习变得困难。这种洞察力导致设计出一种新颖的通用方式,通过名为“暖化”的程序实现任何经常性网络的连接。这个初始化程序的目的是在初始化时使网络内能够达到的可长期内存储性达到的最大化,也就是说,通过相关的输入轨迹方法可以达到的吸引者的数量。 在培训前,使用具体设计的深度梯度梯度梯度梯度梯度下降, 长期性网络的升级,因此,我们展示了一种不断改进的周期性序列,我们展示了一种恢复性分类,在深度的周期性结构中,我们展示了一种恢复性分类,在深度的周期性结构中, 展示了一种不断改进。