兄弟辍学 (Fraternal Dropout)

from arxiv, Accepted to ICLR 2018. Extended appendix. Added official GitHub code for replication: https://github.com/kondiz/fraternal-dropout . Added references. Corrected typos

Recurrent neural networks (RNNs) are important class of architectures among neural networks useful for language modeling and sequential prediction. However, optimizing RNNs is known to be harder compared to feed-forward neural networks. A number of techniques have been proposed in literature to address this problem. In this paper we propose a simple technique called fraternal dropout that takes advantage of dropout to achieve this goal. Specifically, we propose to train two identical copies of an RNN (that share parameters) with different dropout masks while minimizing the difference between their (pre-softmax) predictions. In this way our regularization encourages the representations of RNNs to be invariant to dropout mask, thus being robust. We show that our regularization term is upper bounded by the expectation-linear dropout objective which has been shown to address the gap due to the difference between the train and inference phases of dropout. We evaluate our model and achieve state-of-the-art results in sequence modeling tasks on two benchmark datasets - Penn Treebank and Wikitext-2. We also show that our approach leads to performance improvement by a significant margin in image captioning (Microsoft COCO) and semi-supervised (CIFAR-10) tasks.

翻译：经常性神经网络(RNNs)是神经网络中重要的建筑结构,可用于语言建模和顺序预测。然而,优化RNNs已知比饲料前神经网络更难实现优化。文献中提出了解决这一问题的若干技术。我们在此文件中提议了一个简单的技术,称为兄弟辍学,利用辍学的优势来实现这一目标。具体地说,我们提议用不同的辍学面具来训练两个相同的RNN(共享参数),同时尽量缩小其(软式前)预测之间的差别。这样,我们的正规化鼓励RNS表示对辍学面具的不适应性,从而变得强大。我们表明,我们的正规化术语受预期线性辍学目标的高度约束,该目标已经表明要解决由于火车与辍学的推论阶段之间的差异而产生的差距。我们评估了我们的模型,并在两个基准数据集(Penn Treebank和Wikiptext-2)的顺序建模任务中取得了最新的结果。我们还表明,我们的做法可以通过一个显著的图像定位空间改进(MIROCO)和10号任务(MIRCROCO)中的显著差距。