Recurrent neural networks with a gating mechanism such as an LSTM or GRU are powerful tools to model sequential data. In the mechanism, a forget gate, which was introduced to control information flow in a hidden state in the RNN, has recently been re-interpreted as a representative of the time scale of the state, i.e., a measure how long the RNN retains information on inputs. On the basis of this interpretation, several parameter initialization methods to exploit prior knowledge on temporal dependencies in data have been proposed to improve learnability. However, the interpretation relies on various unrealistic assumptions, such as that there are no inputs after a certain time point. In this work, we reconsider this interpretation of the forget gate in a more realistic setting. We first generalize the existing theory on gated RNNs so that we can consider the case where inputs are successively given. We then argue that the interpretation of a forget gate as a temporal representation is valid when the gradient of loss with respect to the state decreases exponentially as time goes back. We empirically demonstrate that existing RNNs satisfy this gradient condition at the initial training phase on several tasks, which is in good agreement with previous initialization methods. On the basis of this finding, we propose an approach to construct new RNNs that can represent a longer time scale than conventional models, which will improve the learnability for long-term sequential data. We verify the effectiveness of our method by experiments with real-world datasets.
翻译:具有LSTM 或 GRU 等标志性机制的经常性神经网络是模拟连续数据的强大工具。 在这个机制中,为控制RNN隐藏状态的信息流动而引入的忘记门最近被重新解释为代表国家的时间范围,即RNN保留投入信息的时间范围。根据这一解释,提出了几种参数初始化方法,以利用先前关于数据时间依赖性的知识来提高可学习性。然而,解释依赖于各种不切实际的假设,如在一定时间点后没有投入。在这项工作中,我们重新考虑对RNNN隐藏状态的信息流动的这种解释。我们首先在GNNNs上概括现有的理论,以便我们能够考虑输入连续提供输入信息的情况。然后,我们争辩说,当与数据相关的损失程度的梯度随着时间推移而急剧下降时,对“忘记门”的解释是有效的。我们的经验证明,现有的RNNPs在最初的培训阶段满足了这一梯度条件,在更现实的环境下,我们重新考虑了对“忘记大门”的诠释方法,我们用以前的常规方法来改进了“常规”方法,我们用新的方法来学习。