Transformer-based models have been achieving state-of-the-art results in several fields of Natural Language Processing. However, its direct application to speech tasks is not trivial. The nature of this sequences carries problems such as long sequence lengths and redundancy between adjacent tokens. Therefore, we believe that regular self-attention mechanism might not be well suited for it. Different approaches have been proposed to overcome these problems, such as the use of efficient attention mechanisms. However, the use of these methods usually comes with a cost, which is a performance reduction caused by information loss. In this study, we present the Multiformer, a Transformer-based model which allows the use of different attention mechanisms on each head. By doing this, the model is able to bias the self-attention towards the extraction of more diverse token interactions, and the information loss is reduced. Finally, we perform an analysis of the head contributions, and we observe that those architectures where all heads relevance is uniformly distributed obtain better results. Our results show that mixing attention patterns along the different heads and layers outperforms our baseline by up to 0.7 BLEU.
翻译:以变换器为基础的模型已经在自然语言处理的若干领域取得了最新成果。 但是,它直接应用于演讲任务并非微不足道。 这种序列的性质包含诸如长序列长度和相邻象征之间的冗余等问题。 因此,我们认为,定期的自我注意机制可能并不十分适合它。 提出了不同的办法来克服这些问题,例如使用高效关注机制。 但是,使用这些方法通常要付出成本,也就是由于信息损失而导致的性能下降。 在本研究中,我们介绍了一个基于变换器的模型,它允许对每头使用不同的关注机制。 通过这样做,该模型能够将自我注意偏向于提取更加多样的象征互动,信息损失也会减少。 最后,我们对头部贡献的分析,我们观察到,所有头部都具有统一性分布关系的那些结构会得到更好的结果。 我们的结果表明,将注意力模式与不同头部和层次混合在一起,会比我们的基线高出0.7 BLEU。