Convolutional models have been widely used in multiple domains. However, most existing models only use local convolution, making the model unable to handle long-range dependency efficiently. Attention overcomes this problem by aggregating global information but also makes the computational complexity quadratic to the sequence length. Recently, Gu et al. [2021] proposed a model called S4 inspired by the state space model. S4 can be efficiently implemented as a global convolutional model whose kernel size equals the input sequence length. S4 can model much longer sequences than Transformers and achieve significant gains over SoTA on several long-range tasks. Despite its empirical success, S4 is involved. It requires sophisticated parameterization and initialization schemes. As a result, S4 is less intuitive and hard to use. Here we aim to demystify S4 and extract basic principles that contribute to the success of S4 as a global convolutional model. We focus on the structure of the convolution kernel and identify two critical but intuitive principles enjoyed by S4 that are sufficient to make up an effective global convolutional model: 1) The parameterization of the convolutional kernel needs to be efficient in the sense that the number of parameters should scale sub-linearly with sequence length. 2) The kernel needs to satisfy a decaying structure that the weights for convolving with closer neighbors are larger than the more distant ones. Based on the two principles, we propose a simple yet effective convolutional model called Structured Global Convolution (SGConv). SGConv exhibits strong empirical performance over several tasks: 1) With faster speed, SGConv surpasses S4 on Long Range Arena and Speech Command datasets. 2) When plugging SGConv into standard language and vision models, it shows the potential to improve both efficiency and performance.
翻译:然而,大多数现有模型只能使用本地变换,使模型无法有效处理长距离依赖性。关注通过汇集全球信息克服了这一问题,但也使计算复杂度与序列长度相交。最近,Gu等人(2021年)提议了一个由国家空间模型启发的S4模型。S4可以作为一个全球变动模型有效实施,其内核大小等于输入序列长度。S4可以作为全球变动模型有效实施。S4可以模拟比变异器长得多的序列,并在一些远程任务中比SoTA取得显著的收益。尽管它取得了经验性能成功,但S4参与的S4需要复杂的参数化和初始化计划。因此,S4的变异性更不易和难以使用。我们在这里的目标是解析S4,并提取一些基本原则,作为全球变异变模型的成功。我们侧重于变异模型的结构,确定两个关键但直观的原则,这足以使全球变异性模型产生有效的变异性变变变变变变模型。 更精确的变异性变变变变更甚甚远的S2级结构需要更精确性变更精确的变更精确的变更精确的变更精确性结构。