The advent of Transformer-based models has surpassed the barriers of text. When working with speech, we must face a problem: the sequence length of an audio input is not suitable for the Transformer. To bypass this problem, a usual approach is adding strided convolutional layers, to reduce the sequence length before using the Transformer. In this paper, we propose a new approach for direct Speech Translation, where thanks to an efficient Transformer we can work with a spectrogram without having to use convolutional layers before the Transformer. This allows the encoder to learn directly from the spectrogram and no information is lost. We have created an encoder-decoder model, where the encoder is an efficient Transformer -- the Longformer -- and the decoder is a traditional Transformer decoder. Our results, which are close to the ones obtained with the standard approach, show that this is a promising research direction.
翻译:以变换器为基础的模型的出现超过了文本的屏障。 当使用语言时, 我们不得不面对一个问题: 音频输入的序列长度不适合变换器。 为了绕过这个问题, 通常的做法是在使用变换器之前添加累进的共变层, 以缩短序列长度 。 在本文中, 我们建议了一种直接语音翻译的新方法, 借助一个高效变换器, 我们就可以在变换器之前使用光谱层进行工作 。 这使得编码器能够直接从光谱中学习, 没有丢失任何信息。 我们已经创建了一个编码器解码器模型, 其编码器是一个高效的变换器 -- -- 长的变换器, 解码器是一个传统的变换器。 我们的结果与标准方法接近, 显示这是一个很有希望的研究方向 。