Attention-based autoregressive models have achieved state-of-the-art performance in various sequence-to-sequence tasks, including Text-To-Speech (TTS) and Neural Machine Translation (NMT), but can be difficult to train. The standard training approach, teacher forcing, guides a model with the reference back-history. During inference, the generated back-history must be used. This mismatch limits the evaluation performance. Attention forcing has been introduced to address the mismatch, guiding the model with the generated back-history and reference attention. While successful in tasks with continuous outputs like TTS, attention forcing faces additional challenges in tasks with discrete outputs like NMT. This paper introduces the two extensions of attention forcing to tackle these challenges. (1) Scheduled attention forcing automatically turns attention forcing on and off, which is essential for tasks with discrete outputs. (2) Parallel attention forcing makes training parallel, and is applicable to Transformer-based models. The experiments show that the proposed approaches improve the performance of models based on RNNs and Transformers.
翻译:基于关注的自动递减模式在各种顺序到顺序的任务中取得了最先进的业绩,包括文本到语音(TTS)和神经机器翻译(NMT),但可能难以培训。标准培训方法、教师强迫、用参考背历史指导模型。在推断过程中,必须使用生成的背历史。这种不匹配限制了评价绩效。已经引入了关注压力以解决不匹配问题,引导模型与生成的后历史和参考关注。在像TTS这样的连续产出中取得成功的同时,在与NMT这样的离散产出的任务中,迫使关注面临额外的挑战。本文介绍了迫使人们关注应对这些挑战的两个延伸。 (1) 排在时间上的注意使注意力自动转向,这对离散产出的任务至关重要。(2) 平行关注使培训平行,适用于以变异器为基础的模式。实验表明,拟议的方法改善了基于RNNPs和变异器的模式的绩效。