Learning hierarchical structures in sequential data -- from simple algorithmic patterns to natural language -- in a reliable, generalizable way remains a challenging problem for neural language models. Past work has shown that recurrent neural networks (RNNs) struggle to generalize on held-out algorithmic or syntactic patterns without supervision or some inductive bias. To remedy this, many papers have explored augmenting RNNs with various differentiable stacks, by analogy with finite automata and pushdown automata (PDAs). In this paper, we improve the performance of our recently proposed Nondeterministic Stack RNN (NS-RNN), which uses a differentiable data structure that simulates a nondeterministic PDA, with two important changes. First, the model now assigns unnormalized positive weights instead of probabilities to stack actions, and we provide an analysis of why this improves training. Second, the model can directly observe the state of the underlying PDA. Our model achieves lower cross-entropy than all previous stack RNNs on five context-free language modeling tasks (within 0.05 nats of the information-theoretic lower bound), including a task on which the NS-RNN previously failed to outperform a deterministic stack RNN baseline. Finally, we propose a restricted version of the NS-RNN that incrementally processes infinitely long sequences, and we present language modeling results on the Penn Treebank.
翻译:在从简单的算法模式到自然语言的顺序数据中学习等级结构 -- -- 从简单的算法模式到自然语言 -- -- 可靠、普遍的方式,仍然是神经语言模型的一个棘手问题。过去的工作表明,反复出现的神经网络(RNNS)在没有监督或某些暗示偏差的情况下,难以对停滞的算法或合成模式进行概括化。为了纠正这一点,许多论文探索了以各种不同的堆叠来增加RNS,与有限的自定义模型和自定义自动数据(PDAs)做类比。在本文中,我们改进了我们最近提出的Nondeministic Stack Stack RNNN(NS-RNNNNN)(NS-R-NNNNNNNNNN)(NS)(NS-ND) (NS-ND) (NNNT) 模拟5种背景的不完全的不完全的数据结构的模型的性能。我们分析了为什么要改进培训。第二,模型可以直接观察PDA的状态。我们的模型在5种无背景语言模型上比以前的所有堆的RNNNNNNNNNNNNN(NS) 模型的模型(O) 和不固定的递增缩的模型,最后的基线,在一种不固定的模型中,在一种不固定的基式的模型上,在一种不固定的基线上,在一种不固定的模型中,在不固定的模型中,在一种不固定的基线上。