Lectures translation is a case of spoken language translation and there is a lack of publicly available parallel corpora for this purpose. To address this, we examine a language independent framework for parallel corpus mining which is a quick and effective way to mine a parallel corpus from publicly available lectures at Coursera. Our approach determines sentence alignments, relying on machine translation and cosine similarity over continuous-space sentence representations. We also show how to use the resulting corpora in a multistage fine-tuning based domain adaptation for high-quality lectures translation. For Japanese--English lectures translation, we extracted parallel data of approximately 40,000 lines and created development and test sets through manual filtering for benchmarking translation performance. We demonstrate that the mined corpus greatly enhances the quality of translation when used in conjunction with out-of-domain parallel corpora via multistage training. This paper also suggests some guidelines to gather and clean corpora, mine parallel sentences, address noise in the mined data, and create high-quality evaluation splits. For the sake of reproducibility, we will release our code for parallel data creation.
翻译:为了解决这个问题,我们检查平行矿藏的语文独立框架,这是在Cournara公开授课中开采平行矿藏的一个快速而有效的方法。我们的方法决定了刑罚的调整,依靠机器翻译和连续空间句子的相似性。我们还展示了如何在基于多阶段的微调基础上将由此形成的碳体用于高质量讲座翻译的基于高质量讲座翻译的基于多阶段微调的域适应中。对于日文-英文讲座的翻译,我们提取了大约40,000条线的平行数据,并通过对基准翻译性能进行人工过滤创建了开发和测试套件。我们证明,雷区通过多阶段培训,在与外部平行岩体一起使用时,极大地提高了翻译质量。本文还提出了一些指南,用以收集和清理岩体、矿平行句、处理矿藏中的噪音,并创建高质量的评估分解。为了重新说明,我们将发布平行数据创建守则。