Deliberation networks are a family of sequence-to-sequence models, which have achieved state-of-the-art performance in a wide range of tasks such as machine translation and speech synthesis. A deliberation network consists of multiple standard sequence-to-sequence models, each one conditioned on the initial input and the output of the previous model. During training, there are several key questions: whether to apply Monte Carlo approximation to the gradients or the loss, whether to train the standard models jointly or separately, whether to run an intermediate model in teacher forcing or free running mode, whether to apply task-specific techniques. Previous work on deliberation networks typically explores one or two training options for a specific task. This work introduces a unifying framework, covering various training options, and addresses the above questions. In general, it is simpler to approximate the gradients. When parallel training is essential, separate training should be adopted. Regardless of the task, the intermediate model should be in free running mode. For tasks where the output is continuous, a guided attention loss can be used to prevent degradation into a standard model.
翻译:审议网络由多个标准序列序列模式组成,每个模式都以前一个模式的初始投入和产出为条件。在培训期间,有几个关键问题:是将蒙特卡洛近似值应用于梯度,还是合并或单独地培训标准模式,是运行教师强迫或自由运行模式的中间模式,还是应用特定任务技术。以往关于审议网络的工作通常为具体任务探索一两个培训选项。这项工作引入了一个统一框架,涵盖各种培训选项,并解决上述问题。一般而言,比较简单,比较接近梯度。在平行培训至关重要时,应当采用单独培训。不管任务如何,中间模式应该处于自由运行模式。对于产出持续的任务,可以使用引导性关注损失来防止退化为标准模式。