We introduceDropDim, a structured dropout method designed for regularizing the self-attention mechanism, which is a key component of the transformer. In contrast to the general dropout method, which randomly drops neurons, DropDim drops part of the embedding dimensions. In this way, the semantic information can be completely discarded. Thus, the excessive coadapting between different embedding dimensions can be broken, and the self-attention is forced to encode meaningful featureswith a certain number of embedding dimensions erased. Experiments on a wide range of tasks executed on the MUST-C English-Germany dataset show that DropDim can effectively improve model performance, reduce over-fitting, and show complementary effects with other regularization methods. When combined with label smoothing, the WER can be reduced from 19.1% to 15.1% on the ASR task, and the BLEU value can be increased from26.90 to 28.38 on the MT task. On the ST task, the model can reach a BLEU score of 22.99, an increase by 1.86 BLEU points compared to the strong baseline.
翻译:我们提出DropDim,一种结构化丢弃方法,旨在对自我注意机制进行正则化,这是Transformer的关键组成部分。与一般的dropout方法随机丢弃神经元不同,DropDim丢弃了部分嵌入维度。通过这种方式,语义信息可以完全丢弃。因此,不同嵌入维度之间的过度相互适应可以被打破,强制自我注意力在擦除了一定数量的嵌入维度的情况下编码有意义的特征。在MUST-C英德数据集上进行的广泛任务实验表明,DropDim可以有效地提高模型性能,减少过拟合,并与其他正则化方法产生互补效应。当与标签平滑组合时,WER可以从19.1%降低到15.1%,MT任务的BLEU值可以从26.90提高到28.38。在ST任务中,模型可以达到22.99 BLEU分数,比强基线高1.86 BLEU分。