【注意力机制】一系列关于attention的高效改进大集合

2020 年 8 月 26 日 深度学习自然语言处理

编辑：NewBeeNLP

前几天逛github刷到一个『awesome-fast-attention』大列表，整理了一系列关于attention的高效改进文章，包括论文、引用量、源码实现、算法复杂度以及关键亮点。其中一部分论文，我们在之前的『Transformer Assemble』系列文章中也都有作过解读~

Efficient Attention

Paper (引用量)	源码实现	Main Idea
Generating Wikipedia by Summarizing Long Sequences^[1] (208)	memory-compressed-attention^[2]	compresses key and value + blocked attention
CBAM: Convolutional Block Attention Module^[3] (677)	attention-module^[4]	combines the SE attention with a per pixel(local) weight
CCNet: Criss-Cross Attention for Semantic Segmentation^[5] (149)	CCNet^[6]	each pixel attends to its row and column simultaneously
Efficient Attention: Attention with Linear Complexities^[7] (2)	efficient-attention^[8]	Softmax(Q)(Softmax(K^T)V)
Star-Transformer^[9] (24)	fastNLP^[10]	uses a relay(global) node and attends to/from that node
Generating Long Sequences with Sparse Transformers^[11] (139)	torch-blocksparse^[12]	sparse block based attention
GCNet: Non-local Networks Meet Squeeze-Excitation Networks and Beyond^[13] (96)	GCNet^[14]	squeeze and excitation with an attention pooling (instead of a GAP)
SCRAM: Spatially Coherent Randomized Attention Maps^[15] (1)	-	uses PatchMatch to find close keys
Interlaced Sparse Self-Attention for Semantic Segmentation^[16] (13)	IN_PAPER	combination of a short length and then long range(dilated) attention
Permutohedral Attention Module for Efficient Non-Local Neural Networks^[17] (2)	Permutohedral_attention_module^[18]	uses permutohedral lattice approximation algorithm to approximate the attention output
Large Memory Layers with Product Keys^[19] (28)	XLM^[20]	search for nearest neighbor keys
Expectation-Maximization Attention Networks for Semantic Segmentation^[21] (38)	EMANet^[22]	applys expectation maximization to cluster keys into k clusters
Compressive Transformers for Long-Range Sequence Modelling^[23] (20)	compressive-transformer-pytorch^[24]	compresses distant tokens instead of just stop_grad() ing them, more efficient version of transformerXL
BP-Transformer: Modelling Long-Range Context via Binary Partitioning^[25] (8)	BPT^[26]	attends to distant tokens coarsely and attends to close tokens in a more fine-grained manner
Axial Attention in Multidimensional Transformers^[27] (5)	axial-attention^[28]	apply attention on each axis separately
Reformer: The Efficient Transformer^[29] (69)	trax^[30]	uses LSH to find close keys
Transformer on a Diet^[31] (2)	transformer-on-diet^[32]	dilated transformer like wavenet
Sparse Sinkhorn Attention^[33] (4)	sinkhorn-transformer^[34]	uses a cost matrix to limit attention between buckets
SAC: Accelerating and Structuring Self-Attention via Sparse Adaptive Connection^[35] (1)	-	learns the q, k connections == dynamically creates a sparse attention matrix
Efficient Content-Based Sparse Attention with Routing Transformers^[36] (11)	routing-transformer^[37]	computes attention with same-cluster tokens (computed by online k-means)
Longformer: The Long-Document Transformer^[38] (15)	longformer^[39]	global + blocked attention
Neural Architecture Search for Lightweight Non-Local Networks^[40] (2)	AutoNL^[41]	computes Q(KV) and also down samples q, k, v both in spatial and channel dimensions
ETC: Encoding Long and Structured Data in Transformers^[42] (2)	-	combines global attention (star transformer with multiple global tokens) with local attention
Multi-scale Transformer Language Models^[43] (1)	IN_PAPER	UNet like + retina attetion is something close to BP-Transformer
Synthesizer: Rethinking Self-Attention in Transformer Models^[44] (5)	-	does not compute pairwise interactions
Jukebox: A Generative Model for Music^[45] (9)	jukebox^[46]	better attention patterns from Sparse Transformer
GMAT: Global Memory Augmentation for Transformers^[47] (0)	gmat^[48]	adds global tokens
Masked Language Modeling for Proteins via Linearly Scalable Long-Context Transformers^[49] (0)	google-research^[50]	calculate an unbiased stochastic approximation of the attention matrix
Hand-crafted Attention is All You Need? A Study of Attention on Self-supervised Audio Transformer^[51] (0)	-	does not compute pairwise interactions and uses fixed mask patters
Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention^[52] (1)	fast-transformers^[53]	uses phi(q)(phi(k)v) and also improves the sequential sampling step
Linformer: Self-Attention with Linear Complexity^[54] (3)	linformer-pytorch^[55]	project key and value from nd
Real-time Semantic Segmentation with Fast Attention^[56] (0)	-	l2_norm(q)(l2_norm(k)v)
Fast Transformers with Clustered Attention^[57] (0)	fast-transformers^[58]	groups queries together with LSH
Big Bird: Transformers for Longer Sequences^[59] (0)	-	ETC with random connections

文章

A Survey of Long-Term Context in Transformers ^[60]
Transformers Assemble（PART I）
Transformers Assemble（PART II）
Transformers Assemble（PART III）
Transformers Assemble（PART IV）
Transformers Assemble（PART V）
ICLR2020 | 深度自适应Transformer
Memory Transformer，一种简单明了的Transformer改造方案
【ICLR2020】Transformer Complex-order：一种新的位置编码方式

本文参考资料

[1]

Generating Wikipedia by Summarizing Long Sequences: https://arxiv.org/abs/1801.10198v1

[2]

memory-compressed-attention: https://github.com/lucidrains/memory-compressed-attention

[3]

CBAM: Convolutional Block Attention Module: https://arxiv.org/abs/1807.06521v2

[4]

attention-module: https://github.com/Jongchan/attention-module

[5]

CCNet: Criss-Cross Attention for Semantic Segmentation: https://arxiv.org/abs/1811.11721v2

[6]

CCNet: https://github.com/speedinghzl/CCNet

[7]

Efficient Attention: Attention with Linear Complexities: https://arxiv.org/abs/1812.01243v8

[8]

efficient-attention: https://github.com/cmsflash/efficient-attention

[9]

Star-Transformer: https://arxiv.org/abs/1902.09113v2

[10]

fastNLP: https://github.com/fastnlp/fastNLP/blob/master/fastNLP/modules/encoder/star_transformer.py

[11]

Generating Long Sequences with Sparse Transformers: https://arxiv.org/abs/1904.10509v1

[12]

torch-blocksparse: https://github.com/ptillet/torch-blocksparse

[13]

GCNet: Non-local Networks Meet Squeeze-Excitation Networks and Beyond: https://arxiv.org/abs/1904.11492v1

[14]

GCNet: https://github.com/xvjiarui/GCNet

[15]

SCRAM: Spatially Coherent Randomized Attention Maps: https://arxiv.org/abs/1905.10308v1

[16]

Interlaced Sparse Self-Attention for Semantic Segmentation: https://arxiv.org/abs/1907.12273v2

[17]

Permutohedral Attention Module for Efficient Non-Local Neural Networks: https://arxiv.org/abs/1907.00641v2

[18]

Permutohedral_attention_module: https://github.com/SamuelJoutard/Permutohedral_attention_module

[19]

Large Memory Layers with Product Keys: https://arxiv.org/abs/1907.05242v2

[20]

XLM: https://github.com/facebookresearch/XLM

[21]

Expectation-Maximization Attention Networks for Semantic Segmentation: https://arxiv.org/abs/1907.13426v2

[22]

EMANet: https://github.com/XiaLiPKU/EMANet

[23]

Compressive Transformers for Long-Range Sequence Modelling: https://arxiv.org/abs/1911.05507v1

[24]

compressive-transformer-pytorch: https://github.com/lucidrains/compressive-transformer-pytorch

[25]

BP-Transformer: Modelling Long-Range Context via Binary Partitioning: https://arxiv.org/abs/1911.04070v1

[26]

BPT: https://github.com/yzh119/BPT

[27]

Axial Attention in Multidimensional Transformers: https://arxiv.org/abs/1912.12180v1

[28]

axial-attention: https://github.com/lucidrains/axial-attention

[29]

Reformer: The Efficient Transformer: https://arxiv.org/abs/2001.04451v2

[30]

trax: https://github.com/google/trax/tree/master/trax/models/reformer

[31]

Transformer on a Diet: https://arxiv.org/abs/2002.06170v1

[32]

transformer-on-diet: https://github.com/cgraywang/transformer-on-diet

[33]

Sparse Sinkhorn Attention: https://arxiv.org/abs/2002.11296v1

[34]

sinkhorn-transformer: https://github.com/lucidrains/sinkhorn-transformer

[35]

SAC: Accelerating and Structuring Self-Attention via Sparse Adaptive Connection: https://arxiv.org/abs/2003.09833v2

[36]

Efficient Content-Based Sparse Attention with Routing Transformers: https://arxiv.org/abs/2003.05997v1

[37]

routing-transformer: https://github.com/lucidrains/routing-transformer

[38]

Longformer: The Long-Document Transformer: https://arxiv.org/abs/2004.05150v1

[39]

longformer: https://github.com/allenai/longformer

[40]

Neural Architecture Search for Lightweight Non-Local Networks: https://arxiv.org/abs/2004.01961v1

[41]

AutoNL: https://github.com/LiYingwei/AutoNL

[42]

ETC: Encoding Long and Structured Data in Transformers: https://arxiv.org/abs/2004.08483v2

[43]

Multi-scale Transformer Language Models: https://arxiv.org/abs/2005.00581v1

[44]

Synthesizer: Rethinking Self-Attention in Transformer Models: https://arxiv.org/abs/2005.00743v1

[45]

Jukebox: A Generative Model for Music: https://arxiv.org/abs/2005.00341v1

[46]

jukebox: https://github.com/openai/jukebox

[47]

GMAT: Global Memory Augmentation for Transformers: https://arxiv.org/abs/2006.03274v1

[48]

gmat: https://github.com/ag1988/gmat

[49]

Masked Language Modeling for Proteins via Linearly Scalable Long-Context Transformers: https://arxiv.org/abs/2006.03555v1

[50]

google-research: https://github.com/google-research/google-research/tree/master/performer/fast_self_attention

[51]

Hand-crafted Attention is All You Need? A Study of Attention on Self-supervised Audio Transformer: https://arxiv.org/abs/2006.05174v1

[52]

Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention: https://arxiv.org/abs/2006.16236v2

[53]

fast-transformers: https://github.com/idiap/fast-transformers

[54]

Linformer: Self-Attention with Linear Complexity: https://arxiv.org/abs/2006.04768v3

[55]

linformer-pytorch: https://github.com/tatp22/linformer-pytorch

[56]

Real-time Semantic Segmentation with Fast Attention: https://arxiv.org/abs/2007.03815v2

[57]

Fast Transformers with Clustered Attention: https://arxiv.org/abs/2007.04825v1

[58]

fast-transformers: https://github.com/idiap/fast-transformers

[59]

Big Bird: Transformers for Longer Sequences: https://arxiv.org/abs/2007.14062v1

[60]

A Survey of Long-Term Context in Transformers: https://www.pragmatic.ml/a-survey-of-methods-for-incorporating-long-term-context/

- END -

推荐两个专辑给大家：

专辑 | 李宏毅人类语言处理2020笔记

专辑 | NLP论文解读

整理不易，还望给个在看！

登录查看更多

相关内容

注意力机制

关注 120

Attention机制最早是在视觉图像领域提出来的，但是真正火起来应该算是google mind团队的这篇论文《Recurrent Models of Visual Attention》[14]，他们在RNN模型上使用了attention机制来进行图像分类。随后，Bahdanau等人在论文《Neural Machine Translation by Jointly Learning to Align and Translate》 [1]中，使用类似attention的机制在机器翻译任务上将翻译和对齐同时进行，他们的工作算是是第一个提出attention机制应用到NLP领域中。接着类似的基于attention机制的RNN模型扩展开始应用到各种NLP任务中。最近，如何在CNN中使用attention机制也成为了大家的研究热点。下图表示了attention研究进展的大概趋势。