Attention is most commonly used in sequence-to-sequence models to attend to encoder states, but can also be used in any sequence model to look back at past states. Using attention, we obtain a context vector ci based on hidden states s1,…,sm that can be used together with the current hidden state hi for prediction. The context vector ci at position is calculated as an average of the previous states weighted with the attention scores ai:
ciai=∑jaijsj=softmax(fatt(hi,sj))
The attention function fatt(hi,sj) calculates an unnormalized alignment score between the current hidden state hi and the previous hidden state sj. In the following, we will discuss four attention variants: i) additive attention, ii) multiplicative attention, iii) self-attention, and iv) key-value attention.
Additive attention The original attention mechanism (Bahdanau et al., 2015) [15] uses a one-hidden layer feed-forward network to calculate the attention alignment:
fatt(hi,sj)=va⊤tanh(Wa[hi;sj])
where va and Wa are learned attention parameters. Analogously, we can also use matrices W1 and W2 to learn separate transformations for hi and sj respectively, which are then summed:
fatt(hi,sj)=va⊤tanh(W1hi+W2sj)
Multiplicative attention Multiplicative attention (Luong et al., 2015) [16] simplifies the attention operation by calculating the following function:
fatt(hi,sj)=h⊤iWasj
Additive and multiplicative attention are similar in complexity, although multiplicative attention is faster and more space-efficient in practice as it can be implemented more efficiently using matrix multiplication. Both variants perform similar for small dimensionality dh of the decoder states, but additive attention performs better for larger dimensions. One way to mitigate this is to scale fatt(hi,sj) by 1/dh‾‾√ (Vaswani et al., 2017) [17].
Attention cannot only be used to attend to encoder or previous hidden states, but also to obtain a distribution over other features, such as the word embeddings of a text as used for reading comprehension (Kadlec et al., 2017) [37]. However, attention is not directly applicable to classification tasks that do not require additional information, such as sentiment analysis. In such models, the final hidden state of an LSTM or an aggregation function such as max pooling or averaging is often used to obtain a sentence representation.
Self-attention Without any additional information, however, we can still extract relevant aspects from the sentence by allowing it to attend to itself using self-attention (Lin et al., 2017) [18]. Self-attention, also called intra-attention has been used successfully in a variety of tasks including reading comprehension (Cheng et al., 2016) [38], textual entailment (Parikh et al., 2016) [39], and abstractive summarization (Paulus et al., 2017) [40].
We can simplify additive attention to compute the unnormalized alignment score for each hidden state hi:
fatt(hi)=va⊤tanh(Wahi)
In matrix form, for hidden states H=h1,…,hn we can calculate the attention vector a and the final sentence representation c as follows:
ac=softmax(vatanh(WaH⊤))=Ha⊤
Rather than only extracting one vector, we can perform several hops of attention by using a matrix Vainstead of va, which allows us to extract an attention matrix A:
AC=softmax(Vatanh(WaH⊤))=AH
In practice, we enforce the following orthogonality constraint to penalize redundancy and encourage diversity in the attention vectors in the form of the squared Frobenius norm:
Ω=‖(AA⊤−I‖2F
A similar multi-head attention is also used by Vaswani et al. (2017).
Key-value attention Finally, key-value attention (Daniluk et al., 2017) [19] is a recent attention variant that separates form from function by keeping separate vectors for the attention calculation. It has also been found useful for different document modelling tasks (Liu & Lapata, 2017) [41]. Specifically, key-value attention splits each hidden vector hi into a key ki and a value vi: [ki;vi]=hi. The keys are used for calculating the attention distribution ai using additive attention:
ai=softmax(va⊤tanh(W1[ki−L;…;ki−1]+(W2ki)1⊤))
where L is the length of the attention window and 1 is a vector of ones. The values are then used to obtain the context representation ci:
ci=[vi−L;…;vi−1]a⊤
The context ci is used together with the current value vi for prediction.
The optimization algorithm and scheme is often one of the parts of the model that is used as-is and treated as a black-box. Sometimes, even slight changes to the algorithm, e.g. reducing the β2 value in Adam (Dozat & Manning, 2017) [50] can make a large difference to the optimization behaviour.
Optimization algorithm Adam (Kingma & Ba, 2015) [21] is one of the most popular and widely used optimization algorithms and often the go-to optimizer for NLP researchers. It is often thought that Adam clearly outperforms vanilla stochastic gradient descent (SGD). However, while it converges much faster than SGD, it has been observed that SGD with learning rate annealing slightly outperforms Adam (Wu et al., 2016). Recent work furthermore shows that SGD with properly tuned momentum outperforms Adam (Zhang et al., 2017) [42].
Optimization scheme While Adam internally tunes the learning rate for every parameter (Ruder, 2016) [22], we can explicitly use SGD-style annealing with Adam. In particular, we can perform learning rate annealing with restarts: We set a learning rate and train the model until convergence. We then halve the learning rate and restart by loading the previous best model. In Adam's case, this causes the optimizer to forget its per-parameter learning rates and start fresh. Denkowski & Neubig (2017) [23] show that Adam with 2 restarts and learning rate annealing is faster and performs better than SGD with annealing.
Combining multiple models into an ensemble by averaging their predictions is a proven strategy to improve model performance. While predicting with an ensemble is expensive at test time, recent advances in distillation allow us to compress an expensive ensemble into a much smaller model (Hinton et al., 2015; Kuncoro et al., 2016; Kim & Rush, 2016) [24, 25, 26].
Ensembling is an important way to ensure that results are still reliable if the diversity of the evaluated models increases (Denkowski & Neubig, 2017). While ensembling different checkpoints of a model has been shown to be effective (Jean et al., 2015; Sennrich et al., 2016) [51, 52], it comes at the cost of model diversity. Cyclical learning rates can help to mitigate this effect (Huang et al., 2017) [53]. However, if resources are available, we prefer to ensemble multiple independently trained models to maximize model diversity.
Rather than pre-defining or using off-the-shelf hyperparameters, simply tuning the hyperparameters of our model can yield significant improvements over baselines. Recent advances in Bayesian Optimization have made it an ideal tool for the black-box optimization of hyperparameters in neural networks (Snoek et al., 2012) [56] and far more efficient than the widely used grid search. Automatic tuning of hyperparameters of an LSTM has led to state-of-the-art results in language modeling, outperforming models that are far more complex (Melis et al., 2017).
Learning the initial state We generally initialize the initial LSTM states with a 0 vector. Instead of fixing the initial state, we can learn it like any other parameter, which can improve performance and is also recommended by Hinton. Refer to this blog post for a Tensorflow implementation.
Tying input and output embeddings Input and output embeddings account for the largest number of parameters in the LSTM model. If the LSTM predicts words as in language modelling, input and output parameters can be shared (Inan et al., 2016; Press & Wolf, 2017) [54, 55]. This is particularly useful on small datasets that do not allow to learn a large number of parameters.
Gradient norm clipping One way to decrease the risk of exploding gradients is to clip their maximum value (Mikolov, 2012) [57]. This, however, does not improve performance consistently (Reimers & Gurevych, 2017). Rather than clipping each gradient independently, clipping the global norm of the gradient (Pascanu et al., 2013) [58] yields more significant improvements (a Tensorflow implementation can be found here).
Down-projection To reduce the number of output parameters further, the hidden state of the LSTM can be projected to a smaller size. This is useful particularly for tasks with a large number of outputs, such as language modelling (Melis et al., 2017).