In this study, we investigate the generalization of LSTM, ReLU and GRU models on counting tasks over long sequences. Previous theoretical work has established that RNNs with ReLU activation and LSTMs have the capacity for counting with suitable configuration, while GRUs have limitations that prevent correct counting over longer sequences. Despite this and some positive empirical results for LSTMs on Dyck-1 languages, our experimental results show that LSTMs fail to learn correct counting behavior for sequences that are significantly longer than in the training data. ReLUs show much larger variance in behavior and in most cases worse generalization. The long sequence generalization is empirically related to validation loss, but reliable long sequence generalization seems not practically achievable through backpropagation with current techniques. We demonstrate different failure modes for LSTMs, GRUs and ReLUs. In particular, we observe that the saturation of activation functions in LSTMs and the correct weight setting for ReLUs to generalize counting behavior are not achieved in standard training regimens. In summary, learning generalizable counting behavior is still an open problem and we discuss potential approaches for further research.
翻译:在这项研究中,我们调查了LSTM、RELU和GRU关于计算长序列任务模型的概括性。 先前的理论工作已经确定,使用RELU激活和LSTMs的RNNs具有以适当配置进行计算的能力, 而GRUs有阻止正确计算长序列的局限性。 尽管对于Dyck-1语言的LSTMs来说,LSTM、RELU和GRU的这些模型,我们实验结果显示,LSTMs未能对比培训数据长得多的序列的正确计算行为进行正确计算。 reLUs在行为上表现出更大的差异,在多数情况下更差的概括性。 长序列一般化与验证损失有关,但可靠的长序列一般化似乎实际上无法通过对当前技术进行反演算而实际实现。 我们展示了LSTMs、GRUs和ReLUs不同的失败模式。 特别是,我们发现,LUs的激活功能的饱和对一般计算行为的正确重量设置并不是在标准的培训制度中实现的。 总之, 学习一般计算行为仍是一个尚未解决的问题,我们讨论进一步研究的潜在方法。