Prior work has attempted to understand the internal structures and functionalities of Transformer-based encoder-decoder architectures on the level of multi-head attention and feed-forward sublayers. Interpretations have focused on the encoder and decoder, along with the combinatorial possibilities of the self-attention, cross-attention, and feed-forward sublayers. However, without examining the low-level structures, one gains limited understanding of the motivation behind sublayer reordering. Could we dive into the sublayer abstraction and permute layer weight matrices to improve the quality of translation? We propose AEIUOrder to greedily reorder layer weight matrices in the encoder by their well-trainedness, as measured by Heavy-Tailed Self-Regularization (HT-SR) metrics, and order the decoder matrices correspondingly. Our results suggest that greedily reordering layer weight matrices to maximize Total well-trainedness facilitates the model to learn representations and generate translations more effectively.
翻译:先前的工作试图了解以变换器为基础的编码器-编码器-编码器结构的内部结构和功能,了解多头注意和饲料向前的子层,解释的重点是编码器和编码器,以及自我注意、交叉注意和饲料向前的子层的组合可能性。然而,在不研究低层结构的情况下,对次层重新排序背后的动机的理解有限。我们能否潜入次层抽取和渗透层重量矩阵,以提高翻译质量?我们建议AEIOrder根据经过良好训练的重力自省自解(HT-SR)指标测量的编码器中贪婪地重新排序层重量矩阵,并相应命令解码器矩阵。我们的结果表明,贪婪地重新排序层重量矩阵,以最大限度地增加总体训练,有助于模型学习表述,更有效地产生翻译。