In many sequential tasks, a model needs to remember relevant events from the distant past to make correct predictions. Unfortunately, a straightforward application of gradient based training requires intermediate computations to be stored for every element of a sequence. This requires prohibitively large computing memory if a sequence consists of thousands or even millions elements, and as a result, makes learning of very long-term dependencies infeasible. However, the majority of sequence elements can usually be predicted by taking into account only temporally local information. On the other hand, predictions affected by long-term dependencies are sparse and characterized by high uncertainty given only local information. We propose MemUP, a new training method that allows to learn long-term dependencies without backpropagating gradients through the whole sequence at a time. This method can be potentially applied to any gradient based sequence learning. MemUP implementation for recurrent architectures shows performances better or comparable to baselines while requiring significantly less computing memory.
翻译:在许多相继任务中,模型需要记住远古的相关事件才能做出正确的预测。 不幸的是,基于梯度的培训的简单应用要求对序列的每个元素进行中间计算。 如果一个序列由数千个元素组成,甚至数百万个元素组成,从而无法学习非常长期的相互依存性,则这需要巨大的计算内存。然而,大多数序列元素通常只能通过只考虑时间上的当地信息来预测。另一方面,受长期依赖性影响的预测很少,其特点是只有本地信息才具有高度不确定性。我们建议MemUP,这是一个新的培训方法,可以学习长期依赖性,而不会一次反向调整整个序列的梯度。这种方法可能适用于任何基于梯度的序列学习。对经常结构的MEMUP实施显示性能更好,或可与基线相比,同时需要的计算记忆要少得多。