In this paper, we introduce a high-quality and large-scale benchmark dataset for English-Vietnamese speech translation with 508 audio hours, consisting of 331K triplets of (sentence-lengthed audio, English source transcript sentence, Vietnamese target subtitle sentence). We also conduct empirical experiments using strong baselines and find that the traditional "Cascaded" approach still outperforms the modern "End-to-End" approach. To the best of our knowledge, this is the first large-scale English-Vietnamese speech translation study. We hope both our publicly available dataset and study can serve as a starting point for future research and applications on English-Vietnamese speech translation. Our dataset is available at https://github.com/VinAIResearch/PhoST
翻译:在本文中,我们为英语-越南语语音翻译引入了一个高质量的大规模基准数据集,有508个音频小时,由331K三节组成(判决长音频、英语源抄录句、越南目标字幕句),我们还利用强有力的基线进行经验实验,发现传统的“封闭式”方法仍然优于现代的“上到下”方法。据我们所知,这是首个大规模英语-越南语语音翻译研究。我们希望我们公开提供的数据集和研究能够成为未来英语-越南语语音翻译研究和应用的起点。我们的数据集可以在https://github.com/VinAIresearch/PhoST上查阅。