When using recurrent neural networks (RNNs) it is common practice to apply trained models to sequences longer than those seen in training. This "extrapolating" usage deviates from the traditional statistical learning setup where guarantees are provided under the assumption that train and test distributions are identical. Here we set out to understand when RNNs can extrapolate, focusing on a simple case where the data generating distribution is memoryless. We first show that even with infinite training data, there exist RNN models that interpolate perfectly (i.e., they fit the training data) yet extrapolate poorly to longer sequences. We then show that if gradient descent is used for training, learning will converge to perfect extrapolation under certain assumptions on initialization. Our results complement recent studies on the implicit bias of gradient descent, showing that it plays a key role in extrapolation when learning temporal prediction models.
翻译:当使用经常性神经网络(RNN)时,通常的做法是对比培训中看到的时间更长的序列应用经过训练的模型。这种“外推”的用法不同于传统的统计学习结构,即假设火车和测试分布都是一样的,提供保证。在这里,我们开始理解,当RNN可以外推时,侧重于数据生成分布没有记忆的简单案例。我们首先显示,即使有无限的培训数据,也有完全的内推(即它们适合培训数据)的RNN模型,但外推却到更长的序列。我们然后表明,如果将梯度下降用于培训,学习就会在初始化的某些假设下达到完美的外推法。我们的结果补充了最近关于梯度下降隐含偏差的研究,表明它在学习时间预测模型时在外推中起着关键作用。