最佳实践：深度学习用于自然语言处理（二）

2017 年 8 月 20 日 待字闺中 Sebastian Ruder

（continue）

Attention

Attention is most commonly used in sequence-to-sequence models to attend to encoder states, but can also be used in any sequence model to look back at past states. Using attention, we obtain a context vector $c_{i}$

ciai=∑jaijsj=softmax(fatt(hi,sj))

fatt(hi,sj) calculates an unnormalized alignment score between the current hidden state $h_{i}$

fatt(hi,sj)=va⊤tanh(Wa[hi;sj])

fatt(hi,sj)=va⊤tanh(W1hi+W2sj)

fatt(hi,sj)=h⊤iWasj

fatt(hi,sj) by $1 / \sqrt{d_{h}}$

Attention cannot only be used to attend to encoder or previous hidden states, but also to obtain a distribution over other features, such as the word embeddings of a text as used for reading comprehension (Kadlec et al., 2017) [37]. However, attention is not directly applicable to classification tasks that do not require additional information, such as sentiment analysis. In such models, the final hidden state of an LSTM or an aggregation function such as max pooling or averaging is often used to obtain a sentence representation.

Self-attention Without any additional information, however, we can still extract relevant aspects from the sentence by allowing it to attend to itself using self-attention (Lin et al., 2017) [18]. Self-attention, also called intra-attention has been used successfully in a variety of tasks including reading comprehension (Cheng et al., 2016) [38], textual entailment (Parikh et al., 2016) [39], and abstractive summarization (Paulus et al., 2017) [40].

fatt(hi)=va⊤tanh(Wahi)

H=h1,…,hn we can calculate the attention vector $a$

ac=softmax(vatanh(WaH⊤))=Ha⊤

AC=softmax(Vatanh(WaH⊤))=AH

Ω=‖(AA⊤−I‖2F

[ki;vi]=hi. The keys are used for calculating the attention distribution $a_{i}$

ai=softmax(va⊤tanh(W1[ki−L;…;ki−1]+(W2ki)1⊤))

ci=[vi−L;…;vi−1]a⊤

Optimization algorithm Adam (Kingma & Ba, 2015) [21] is one of the most popular and widely used optimization algorithms and often the go-to optimizer for NLP researchers. It is often thought that Adam clearly outperforms vanilla stochastic gradient descent (SGD). However, while it converges much faster than SGD, it has been observed that SGD with learning rate annealing slightly outperforms Adam (Wu et al., 2016). Recent work furthermore shows that SGD with properly tuned momentum outperforms Adam (Zhang et al., 2017) [42].

Optimization scheme While Adam internally tunes the learning rate for every parameter (Ruder, 2016) [22], we can explicitly use SGD-style annealing with Adam. In particular, we can perform learning rate annealing with restarts: We set a learning rate and train the model until convergence. We then halve the learning rate and restart by loading the previous best model. In Adam's case, this causes the optimizer to forget its per-parameter learning rates and start fresh. Denkowski & Neubig (2017) [23] show that Adam with 2 restarts and learning rate annealing is faster and performs better than SGD with annealing.

Ensembling

Combining multiple models into an ensemble by averaging their predictions is a proven strategy to improve model performance. While predicting with an ensemble is expensive at test time, recent advances in distillation allow us to compress an expensive ensemble into a much smaller model (Hinton et al., 2015; Kuncoro et al., 2016; Kim & Rush, 2016) [24, 25, 26].

Ensembling is an important way to ensure that results are still reliable if the diversity of the evaluated models increases (Denkowski & Neubig, 2017). While ensembling different checkpoints of a model has been shown to be effective (Jean et al., 2015; Sennrich et al., 2016) [51, 52], it comes at the cost of model diversity. Cyclical learning rates can help to mitigate this effect (Huang et al., 2017) [53]. However, if resources are available, we prefer to ensemble multiple independently trained models to maximize model diversity.

Hyperparameter optimization

Rather than pre-defining or using off-the-shelf hyperparameters, simply tuning the hyperparameters of our model can yield significant improvements over baselines. Recent advances in Bayesian Optimization have made it an ideal tool for the black-box optimization of hyperparameters in neural networks (Snoek et al., 2012) [56] and far more efficient than the widely used grid search. Automatic tuning of hyperparameters of an LSTM has led to state-of-the-art results in language modeling, outperforming models that are far more complex (Melis et al., 2017).

LSTM tricks

Tying input and output embeddings Input and output embeddings account for the largest number of parameters in the LSTM model. If the LSTM predicts words as in language modelling, input and output parameters can be shared (Inan et al., 2016; Press & Wolf, 2017) [54, 55]. This is particularly useful on small datasets that do not allow to learn a large number of parameters.

Gradient norm clipping One way to decrease the risk of exploding gradients is to clip their maximum value (Mikolov, 2012) [57]. This, however, does not improve performance consistently (Reimers & Gurevych, 2017). Rather than clipping each gradient independently, clipping the global norm of the gradient (Pascanu et al., 2013) [58] yields more significant improvements (a Tensorflow implementation can be found here).

Down-projection To reduce the number of output parameters further, the hidden state of the LSTM can be projected to a smaller size. This is useful particularly for tasks with a large number of outputs, such as language modelling (Melis et al., 2017).

相关内容

注意力机制

关注 120

Attention机制最早是在视觉图像领域提出来的，但是真正火起来应该算是google mind团队的这篇论文《Recurrent Models of Visual Attention》[14]，他们在RNN模型上使用了attention机制来进行图像分类。随后，Bahdanau等人在论文《Neural Machine Translation by Jointly Learning to Align and Translate》 [1]中，使用类似attention的机制在机器翻译任务上将翻译和对齐同时进行，他们的工作算是是第一个提出attention机制应用到NLP领域中。接着类似的基于attention机制的RNN模型扩展开始应用到各种NLP任务中。最近，如何在CNN中使用attention机制也成为了大家的研究热点。下图表示了attention研究进展的大概趋势。

深度学习自然语言处理概述，216页ppt，Jindřich Helcl

专知会员服务

216+阅读 · 2020年4月26日

现代深度学习技术在自然语言处理的应用（Modern Deep Learning Techniques Applied to Natural Language Processing）

专知会员服务

53+阅读 · 2020年4月7日

Transformer文本分类代码

专知会员服务

118+阅读 · 2020年2月3日

【深度学习视频分析/多模态学习资源大列表】

专知会员服务

92+阅读 · 2019年10月16日