In this note we examine the autoregressive generalization of the FNet algorithm, in which self-attention layers from the standard Transformer architecture are substituted with a trivial sparse-uniformsampling procedure based on Fourier transforms. Using the Wikitext-103 benchmark, we demonstratethat FNetAR retains state-of-the-art performance (25.8 ppl) on the task of causal language modelingcompared to a Transformer-XL baseline (24.2 ppl) with only half the number self-attention layers,thus providing further evidence for the superfluity of deep neural networks with heavily compoundedattention mechanisms. The autoregressive Fourier transform could likely be used for parameterreduction on most Transformer-based time-series prediction models.
翻译:在本说明中,我们研究了FNet算法的自动递减概括化,在这种算法中,标准变异器结构的自我注意层被基于Fourier变异的微小稀薄统一抽样程序所取代。我们用Wikiltext-103基准证明,FNetAR保留了与变异器-XL基线(24.2 ppl)相比的因果语言建模任务(25.8 ppl),只有数字的自留层的一半,从而进一步证明具有高度多重注意机制的深层神经网络的超浮性。在大多数变异器基于时间序列的预测模型中,自动递增四倍变可能被用于降低参数。