This paper describes FBK's system submission to the IWSLT 2021 Offline Speech Translation task. We participated with a direct model, which is a Transformer-based architecture trained to translate English speech audio data into German texts. The training pipeline is characterized by knowledge distillation and a two-step fine-tuning procedure. Both knowledge distillation and the first fine-tuning step are carried out on manually segmented real and synthetic data, the latter being generated with an MT system trained on the available corpora. Differently, the second fine-tuning step is carried out on a random segmentation of the MuST-C v2 En-De dataset. Its main goal is to reduce the performance drops occurring when a speech translation model trained on manually segmented data (i.e. an ideal, sentence-like segmentation) is evaluated on automatically segmented audio (i.e. actual, more realistic testing conditions). For the same purpose, a custom hybrid segmentation procedure that accounts for both audio content (pauses) and for the length of the produced segments is applied to the test data before passing them to the system. At inference time, we compared this procedure with a baseline segmentation method based on Voice Activity Detection (VAD). Our results indicate the effectiveness of the proposed hybrid approach, shown by a reduction of the gap with manual segmentation from 8.3 to 1.4 BLEU points.
翻译:本文介绍FBK向IWSLT 2021离线语音翻译任务的系统提交FBK的系统文件。 我们参加的一个直接模型是直接模型,这是一个以变换器为基础的结构,目的是将英语语音数据转换成德文文本。培训管道的特点是知识蒸馏和两步微调程序。知识蒸馏和第一个微调步骤都是用人工分解真实和合成数据进行,后者是用对可用子体进行训练的MT系统生成的。不同的是,第二个微调步骤是按MST-C v2 En-De数据集的随机分割进行。其主要目标是减少语言翻译模型在人工分解数据(即理想的、类似句子的分解)培训时出现的性下降。在自动分解音(即实际的、更现实的测试条件)上进行知识蒸馏和第一个微调步骤。为了同样的目的,一个定制的混合分解程序,既包括音频内容(平面),也包括所制作的分块的长度,然后将它们传送到系统。在模拟分解方法中,我们用模拟分解方法比较了我们所显示的降低的分段的方法。