Transformers are expensive to train due to the quadratic time and space complexity in the self-attention mechanism. On the other hand, although kernel machines suffer from the same computation bottleneck in pairwise dot products, several approximation schemes have been successfully incorporated to considerably reduce their computational cost without sacrificing too much accuracy. In this work, we leverage the computation methods for kernel machines to alleviate the high computational cost and introduce Skyformer, which replaces the softmax structure with a Gaussian kernel to stabilize the model training and adapts the Nystr\"om method to a non-positive semidefinite matrix to accelerate the computation. We further conduct theoretical analysis by showing that the matrix approximation error of our proposed method is small in the spectral norm. Experiments on Long Range Arena benchmark show that the proposed method is sufficient in getting comparable or even better performance than the full self-attention while requiring fewer computation resources.
翻译:另一方面,尽管内核机器在双点产品中也存在同样的计算瓶颈,但一些近似方案已经成功纳入,以大幅降低计算成本,同时又不牺牲过多的精确度。 在这项工作中,我们利用内核机器的计算方法来减轻高计算成本,并引入Skyexon,用高斯内核来取代软体结构,以稳定模型培训,并使Nystr\'om方法适应非阳性半半成型矩阵以加速计算。我们进一步进行理论分析,通过显示我们拟议方法的矩阵近似错误在光谱规范中很小。 长区域实验显示,拟议的方法足以比全部自留能力更具有可比性或更好的性能,同时需要较少的计算资源。