In sequence-to-sequence learning, e.g., natural language generation, the decoder relies on the attention mechanism to efficiently extract information from the encoder. While it is common practice to draw information from only the last encoder layer, recent work has proposed to use representations from different encoder layers for diversified levels of information. Nonetheless, the decoder still obtains only a single view of the source sequences, which might lead to insufficient training of the encoder layer stack due to the hierarchy bypassing problem. In this work, we propose layer-wise multi-view decoding, where for each decoder layer, together with the representations from the last encoder layer, which serve as a global view, those from other encoder layers are supplemented for a stereoscopic view of the source sequences. Systematic experiments and analyses show that we successfully address the hierarchy bypassing problem, require almost negligible parameter increase, and substantially improve the performance of sequence-to-sequence learning with deep representations on five diverse tasks, i.e., machine translation, abstractive summarization, image captioning, video captioning, medical report generation, and paraphrase generation. In particular, our approach achieves new state-of-the-art results on ten benchmark datasets, including a low-resource machine translation dataset and two low-resource medical report generation datasets.
翻译:在从序列到序列的学习中,例如自然语言生成,解码器依赖关注机制来有效地从编码器中提取信息。虽然只是从最后的编码器层提取信息是常见的做法,但最近的工作提议使用不同编码器层的演示来提供不同层次的信息。然而,解码器仍然只获得对源序列的单一观点,这可能导致对编码器层堆层的培训因绕过等级的问题而不够充分。在这项工作中,我们建议对每个解码层进行分层解码,同时从最后的编码器层中提取信息,作为全球观点,对其他编码器层的演示加以补充,以便对源序列进行立体剖析。系统实验和分析表明,我们成功地解决了绕过等级的问题,需要几乎微不足道的参数增加,并大大改进从序列到序列的学习绩效,对五项不同任务进行深入的演示,即:机器翻译、抽象校正、图像解析、图像解析、视频解析、数据生成的新型报告,包括数据生成、数据序列、数据生成、数据转换的新型报告。