Contemporary wisdom based on empirical studies suggests that standard recurrent neural networks (RNNs) do not perform well on tasks requiring long-term memory. However, precise reasoning for this behavior is still unknown. This paper provides a rigorous explanation of this property in the special case of linear RNNs. Although this work is limited to linear RNNs, even these systems have traditionally been difficult to analyze due to their non-linear parameterization. Using recently-developed kernel regime analysis, our main result shows that linear RNNs learned from random initializations are functionally equivalent to a certain weighted 1D-convolutional network. Importantly, the weightings in the equivalent model cause an implicit bias to elements with smaller time lags in the convolution and hence, shorter memory. The degree of this bias depends on the variance of the transition kernel matrix at initialization and is related to the classic exploding and vanishing gradients problem. The theory is validated in both synthetic and real data experiments.
翻译:基于经验研究的当代智慧表明,标准的经常性神经网络(RNN)在完成需要长期记忆的任务方面表现不佳。然而,这种行为的精确推理仍然未知。本文件在线性RNN的特殊情况下对这一属性作了严格的解释。虽然这项工作仅限于线性RNN,但即使这些系统传统上也由于非线性参数化而难以分析。利用最近开发的内核系统分析,我们的主要结果显示,从随机初始化中学到的线性RNN在功能上等同于某种加权的1D革命网络。重要的是,同等模型的权重对演中时间滞后较小的元素造成隐含的偏差,从而缩短了记忆。这种偏差的程度取决于初始化时的转型内核矩阵的差异,并且与典型的爆炸和消亡梯度问题有关。该理论在合成和真实数据实验中都得到验证。