A popular strategy to train recurrent neural networks (RNNs), known as ``teacher forcing'' takes the ground truth as input at each time step and makes the later predictions partly conditioned on those inputs. Such training strategy impairs their ability to learn rich distributions over entire sequences because the chosen inputs hinders the gradients back-propagating to all previous states in an end-to-end manner. We propose a fully differentiable training algorithm for RNNs to better capture long-term dependencies by recovering the probability of the whole sequence. The key idea is that at each time step, the network takes as input a ``bundle'' of similar words predicted at the previous step instead of a single ground truth. The representations of these similar words forms a convex hull, which can be taken as a kind of regularization to the input. Smoothing the inputs by this way makes the whole process trainable and differentiable. This design makes it possible for the model to explore more feasible combinations (possibly unseen sequences), and can be interpreted as a computationally efficient approximation to the beam search. Experiments on multiple sequence generation tasks yield performance improvements, especially in sequence-level metrics, such as BLUE or ROUGE-2.
翻译:培训常见神经网络(RNNS)的流行战略,即“教师强迫”法,将地面真理作为每个时间步骤的投入,使后来的预测部分以这些投入为条件。这种培训战略削弱了他们在整个序列中学习丰富分布的能力,因为所选投入阻碍梯度以端到端的方式向前所有各州回传。我们提议了一个完全不同的培训算法,让RENS通过恢复整个序列的概率来更好地捕捉长期依赖性。关键的想法是,在每一个步骤中,网络将前一步预测的类似词作为输入“undle”而不是一个单一地面真理。这些类似词的表述形成了一个螺旋体,可以被视为对投入的一种正规化。通过这种方式使整个过程的精细化使得整个过程可以培训和不同。这种设计使得模型能够探索更可行的组合(可能为看不见的序列),并且可以被解释为对前一步的计算效率的精确度,而不是单一地面搜索。这些类似词的表达方式构成一个螺旋体体体体体体体体体,可以被视为对投入的一种规范。通过这种程序使整个过程的改进成为B-GE-GE-GE的进度,特别是B-GE-GE-D-D-D-D-D-D-D-D-D-D-D-D-B-D-D-D-D-D-D-B-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-L-L-L-L-L-D-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L