Transformer has shown promising results in many sequence to sequence transformation tasks recently. It utilizes a number of feed-forward self-attention layers to replace the recurrent neural networks (RNN) in attention-based encoder decoder (AED) architecture. Self-attention layer learns temporal dependence by incorporating sinusoidal positional embedding of tokens in a sequence for parallel computing. Quicker iteration speed in training than sequential operation of RNN can be obtained. Deeper layers of the transformer also make it perform better than RNN-based AED. However, this parallelization ability is lost when applying scheduled sampling training. Self-attention with sinusoidal positional embedding may cause performance degradations for longer sequences that have similar acoustic or semantic information at different positions as well. To address these problems, we propose to use parallel scheduled sampling (PSS) and relative positional embedding (RPE) to help the transformer generalize to unseen data. Our proposed methods achieve a 7% relative improvement for short utterances and a 70% relative gain for long utterances on a 10,000-hour Mandarin ASR task.
翻译:最近, 变换器在许多序列中展示了在排序转换任务的许多序列中出现的有希望的结果 。 它使用一些自控自控层来取代关注的编码解码器( AED) 结构中的经常性神经网络( RNN ) 。 自控层通过在平行计算序列中将符号的正弦性位置嵌入等离子体来学习时间依赖性 。 与 RNN 的连续操作相比, 培训中更快的迭代速度可以得到 。 变换器的更深层也使它的表现比 RNN 的 AED 更好 。 但是, 在应用预定的取样训练时, 这种平行化能力已经丧失了 。 使用正弦性定位嵌入的自留, 可能会在不同位置上产生类似声学或语义信息的长序列导致性性功能退化 。 为了解决这些问题, 我们提议使用平行的定序取样( PSS) 和相对定位嵌入( RPE) 来帮助变换器一般为不可见的数据 。 我们提议的方法在短话中实现7% 7% 相对改进, 在10,000 曼达 任务上长期内 。