State space models have shown to be effective at modeling long range dependencies, specially on sequence classification tasks. In this work we focus on autoregressive sequence modeling over English books, Github source code and ArXiv mathematics articles. Based on recent developments around the effectiveness of gated activation functions, we propose a new layer named Gated State Space (GSS) and show that it trains significantly faster than the diagonal version of S4 (i.e. DSS) on TPUs, is fairly competitive with several well-tuned Transformer-based baselines and exhibits zero-shot generalization to longer inputs while being straightforward to implement. Finally, we show that leveraging self-attention to model local dependencies improves the performance of GSS even further.
翻译:国家空间模型在模拟长距离依赖性,特别是序列分类任务方面已证明行之有效,在这项工作中,我们侧重于在英文书籍、Github源代码和ArXiv数学文章上进行自动递减序列模型,根据关于门形启动功能有效性的最新发展,我们提议一个新的层次,名为Gated State Space(GSS),并表明其培训速度大大快于TPP的S4对角版(即DSS),具有相当的竞争力,与若干基于完善的变压器基线相对应,并展示了对较长投入的零光化,同时可以直接执行。最后,我们表明,利用自我意识模拟本地依赖性可以进一步提高GSS的性能。