Common practice when using recurrent neural networks (RNNs) is to apply a model to sequences longer than those seen in training. This "extrapolating" usage deviates from the traditional statistical learning setup where guarantees are provided under the assumption that train and test distributions are identical. Here we set out to understand when RNNs can extrapolate, focusing on a simple case where the data generating distribution is memoryless. We first show that even with infinite training data, there exist RNN models that interpolate perfectly (i.e., they fit the training data) yet extrapolate poorly to longer sequences. We then show that if gradient descent is used for training, learning will converge to perfect extrapolation under certain assumption on initialization. Our results complement recent studies on the implicit bias of gradient descent, showing that it plays a key role in extrapolation when learning temporal prediction models.
翻译:当使用经常性神经网络(RNN)时,通常的做法是对比培训中看到的时间更长的序列应用一种模式。这种“外推”使用与传统的统计学习结构不同,在这种结构中,根据火车和测试分布相同的假设提供保障。在这里,我们开始理解,当RNN可以外推,侧重于数据生成分布没有记忆的简单案例。我们首先显示,即使有无限的培训数据,也有这样的RNN模型完美地进行内插(即它们适合培训数据),但又不能对更长的序列进行外推。我们然后表明,如果将梯度下降用于培训,学习就会在初始化的某些假设下达到完美外推法。我们的结果补充了最近关于梯度下降隐含偏差的研究,表明它在学习时间预测模型时在外推法中起着关键作用。