End-to-end (E2E) speech-to-text translation (ST) often depends on pretraining its encoder and/or decoder using source transcripts via speech recognition or text translation tasks, without which translation performance drops substantially. However, transcripts are not always available, and how significant such pretraining is for E2E ST has rarely been studied in the literature. In this paper, we revisit this question and explore the extent to which the quality of E2E ST trained on speech-translation pairs alone can be improved. We reexamine several techniques proven beneficial to ST previously, and offer a set of best practices that biases a Transformer-based E2E ST system toward training from scratch. Besides, we propose parameterized distance penalty to facilitate the modeling of locality in the self-attention model for speech. On four benchmarks covering 23 languages, our experiments show that, without using any transcripts or pretraining, the proposed system reaches and even outperforms previous studies adopting pretraining, although the gap remains in (extremely) low-resource settings. Finally, we discuss neural acoustic feature modeling, where a neural model is designed to extract acoustic features from raw speech signals directly, with the goal to simplify inductive biases and add freedom to the model in describing speech. For the first time, we demonstrate its feasibility and show encouraging results on ST tasks.
翻译:终端到终端(E2E)语音到文本翻译(ST)通常取决于使用语音识别或文本翻译任务,使用源代码记录誊本对其编码器和(或)解码器进行预培训,而没有这种记录就大大降低翻译的绩效。然而,记录并不总能得到,而且文献中很少对E2EST进行这种培训的重要性进行研究。在本文件中,我们再次探讨这一问题,并探讨仅就语音翻译对口进行培训的E2EST的质量在多大程度上可以提高。我们重新审查了经证明对ST有益的若干技术,并提供了一套最佳做法,将基于变异器的E2EST系统从零到零地转向培训。此外,我们提议了参数化距离处罚,以便利在自备发言模式中进行地点建模。在涉及23种语言的四个基准上,我们的实验表明,在不使用任何记录或预培训的情况下,拟议的系统达到甚至超越了先前采用预先培训的研究,尽管差距仍然存在于(极端的)低资源环境。最后,我们讨论了神经学特征模型模型建模,将基于原始的语音模型,直接展示其历史定位,以显示其历史特征。