Training speech translation (ST) models requires large and high-quality datasets. MuST-C is one of the most widely used ST benchmark datasets. It contains around 400 hours of speech-transcript-translation data for each of the eight translation directions. This dataset passes several quality-control filters during creation. However, we find that MuST-C still suffers from three major quality issues: audio-text misalignment, inaccurate translation, and unnecessary speaker's name. What are the impacts of these data quality issues for model development and evaluation? In this paper, we propose an automatic method to fix or filter the above quality issues, using English-German (En-De) translation as an example. Our experiments show that ST models perform better on clean test sets, and the rank of proposed models remains consistent across different test sets. Besides, simply removing misaligned data points from the training set does not lead to a better ST model.
翻译:语言翻译(ST) 培训模式需要大量高质量的数据集。 MuST- C 是最广泛使用的ST基准数据集之一。 它包含八个翻译方向中每个方向的大约400小时的语音描述翻译数据。 该数据集在创建过程中通过了若干质量控制过滤器。 然而,我们发现, MuST- C 仍然有三大质量问题: 音频- 文本不匹配、 翻译不准确和不必要的演讲者姓名。 这些数据质量问题对模型开发和评价的影响是什么? 在本文中,我们提出了一个用英语- 德语( En- De) 翻译来修正或过滤以上质量问题的自动方法。 我们的实验显示, ST 模型在清洁测试组合上表现更好, 且拟议模型的级别在不同测试组合之间仍然一致。 此外, 简单地从培训组合中去除不匹配的数据点不会导致更好的ST 模式 。