Transformers have achieved state-of-the-art results across multiple NLP tasks. However, the self-attention mechanism complexity scales quadratically with the sequence length, creating an obstacle for tasks involving long sequences, like in the speech domain. In this paper, we discuss the usefulness of self-attention for Direct Speech Translation. First, we analyze the layer-wise token contributions in the self-attention of the encoder, unveiling local diagonal patterns. To prove that some attention weights are avoidable, we propose to substitute the standard self-attention with a local efficient one, setting the amount of context used based on the results of the analysis. With this approach, our model matches the baseline performance, and improves the efficiency by skipping the computation of those weights that standard attention discards.
翻译:然而,自留机制复杂程度与序列长度相交,对涉及长序列的任务造成障碍,如语音领域。在本文中,我们讨论了直接语音翻译自留的效用。首先,我们分析了对编码器自留的层次象征贡献,并公布了本地对角模式。为了证明某些注意权重是可以避免的,我们建议用一个当地高效的自留标准来取代,根据分析结果确定使用的上下文量。采用这种方法,我们的模型与基线性能相匹配,并通过跳过标准注意抛弃权重的计算来提高效率。