To alleviate the data scarcity problem in End-to-end speech translation (ST), pre-training on data for speech recognition and machine translation is considered as an important technique. However, the modality gap between speech and text prevents the ST model from efficiently inheriting knowledge from the pre-trained models. In this work, we propose AdaTranS for end-to-end ST. It adapts the speech features with a new shrinking mechanism to mitigate the length mismatch between speech and text features by predicting word boundaries. Experiments on the MUST-C dataset demonstrate that AdaTranS achieves better performance than the other shrinking-based methods, with higher inference speed and lower memory usage. Further experiments also show that AdaTranS can be equipped with additional alignment losses to further improve performance.
翻译:为了缓解端至端语音翻译的数据稀缺问题,语音识别和机器翻译数据培训前被认为是一项重要技术,但是,语言和文本之间的模式差距使ST模式无法有效地继承预先培训模式的知识。在这项工作中,我们提议AdaTranS供端至端ST使用。它用一个新的缩小机制调整语音特征,以通过预测单词界限来缓解语音和文本特征之间的时间错配。对 MUST-C数据集的实验表明,AdaTranS的性能优于其他以缩小为主的方法,其推论速度更高,记忆使用率较低。进一步实验还表明,AdaTranS可以配备额外的调整损失,以进一步改善性能。