We introduce Attention Free Transformer (AFT), an efficient variant of Transformers that eliminates the need for dot product self attention. In an AFT layer, the key and value are first combined with a set of learned position biases, the result of which is multiplied with the query in an element-wise fashion. This new operation has a memory complexity linear w.r.t. both the context size and the dimension of features, making it compatible to both large input and model sizes. We also introduce AFT-local and AFT-conv, two model variants that take advantage of the idea of locality and spatial weight sharing while maintaining global connectivity. We conduct extensive experiments on two autoregressive modeling tasks (CIFAR10 and Enwik8) as well as an image recognition task (ImageNet-1K classification). We show that AFT demonstrates competitive performance on all the benchmarks, while providing excellent efficiency at the same time.
翻译:我们引入了 " 注意自由变换器 " (AFT),这是一个高效的变换器,它消除了对点产品自我关注的需要。在AFT层中,关键值和价值首先与一套学到的定位偏差相结合,其结果以元素方式与查询相乘。这一新操作具有内存复杂性的线性(w.r.t.)和特征的尺寸,使其与大输入和模型大小相容。我们还引入了AFT-local和AFT-conv,这是两个模型变异器,既利用地点和空间重量共享的概念,又保持全球连通性。我们对两个自动递增的模型任务(CIFAR10和Enwik8)以及图像识别任务(IMageNet-1K分类)进行了广泛的实验。我们显示,AFT展示了所有基准的竞争性业绩,同时提供了极佳的效率。