In many sequential tasks, a model needs to remember relevant events from the distant past to make correct predictions. Unfortunately, a straightforward application of gradient based training requires intermediate computations to be stored for every element of a sequence. This requires to store prohibitively large intermediate data if a sequence consists of thousands or even millions elements, and as a result, makes learning of very long-term dependencies infeasible. However, the majority of sequence elements can usually be predicted by taking into account only temporally local information. On the other hand, predictions affected by long-term dependencies are sparse and characterized by high uncertainty given only local information. We propose MemUP, a new training method that allows to learn long-term dependencies without backpropagating gradients through the whole sequence at a time. This method can potentially be applied to any recurrent architecture. LSTM network trained with MemUP performs better or comparable to baselines while requiring to store less intermediate data.
翻译:在许多相继任务中,模型需要记住遥远的过去的相关事件才能作出正确的预测。 不幸的是,基于梯度的培训的简单应用要求将中间计算存储于一个序列的每个元素中。这要求存储令人望而却步的大型中间数据,如果一个序列由数千甚至数百万元素组成,从而使得学习非常长期的相互依存性变得不可行。然而,大多数序列元素通常只能通过只考虑时间上的当地信息来预测。另一方面,受长期依赖影响的预测是稀少的,其特征是高度不确定的,只有当地信息才提供。我们建议MemUP,这是一种新的培训方法,可以学习长期依赖性,而不会一次反向反向调整梯度。这种方法可以适用于任何经常性结构。与MemUP培训过的LSTM网络在存储中间数据方面做得更好或更接近基线,同时需要存储较少中间数据。