编码器-解码器还是仅解码器？将语言模型解释为正则化的编码器-解码器 (Decoder-Only or Encoder-Decoder? Interpreting Language Model as a Regularized Encoder-Decoder)

The sequence-to-sequence (seq2seq) task aims at generating the target sequence based on the given input source sequence. Traditionally, most of the seq2seq task is resolved by the Encoder-Decoder framework which requires an encoder to encode the source sequence and a decoder to generate the target text. Recently, a bunch of new approaches have emerged that apply decoder-only language models directly to the seq2seq task. Despite the significant advancements in applying language models to the seq2seq task, there is still a lack of thorough analysis on the effectiveness of the decoder-only language model architecture. This paper aims to address this gap by conducting a detailed comparison between the encoder-decoder architecture and the decoder-only language model framework through the analysis of a regularized encoder-decoder structure. This structure is designed to replicate all behaviors in the classical decoder-only language model but has an encoder and a decoder making it easier to be compared with the classical encoder-decoder structure. Based on the analysis, we unveil the attention degeneration problem in the language model, namely, as the generation step number grows, less and less attention is focused on the source sequence. To give a quantitative understanding of this problem, we conduct a theoretical sensitivity analysis of the attention output with respect to the source input. Grounded on our analysis, we propose a novel partial attention language model to solve the attention degeneration problem. Experimental results on machine translation, summarization, and data-to-text generation tasks support our analysis and demonstrate the effectiveness of our proposed model.

翻译：序列到序列 (seq2seq) 任务旨在基于给定的输入源序列生成目标序列。传统上，大多数 seq2seq 任务都是通过编码器-解码器框架解决的，该框架需要使用编码器对源序列进行编码，并通过解码器生成目标文本。最近，出现了一些新的方法，将仅解码器语言模型直接应用于 seq2seq 任务。尽管在将语言模型应用于 seq2seq 任务方面取得了显着进展，但仍然缺乏对仅解码器语言模型架构有效性的深入分析。本文旨在通过对规则化编码器-解码器结构的分析来解决这一差距。该结构旨在复制经典的仅解码器语言模型中的所有行为，但具有编码器和解码器，因此更容易与经典的编码器-解码器结构进行比较。基于分析，我们揭示了语言模型中的注意力退化问题，即随着生成步数的增加，对源序列的关注越来越少。为了定量理解这个问题，我们对注意力输出进行了理论灵敏度分析，以评估其与源输入的关系。基于我们的分析，我们提出了一种新颖的部分注意力语言模型来解决注意力退化问题。机器翻译、摘要和数据到文本生成任务的实验结果支持我们的分析，并证明了我们提出的模型的有效性。