The existing neural machine translation system has achieved near human-level performance in general domain in some languages, but the lack of parallel corpora poses a key problem in specific domains. In biomedical domain, the parallel corpus is less accessible. This work presents a new unsupervised sentence alignment method and explores features in training biomedical neural machine translation (NMT) systems. We use a simple but effective way to build bilingual word embeddings (BWEs) to evaluate bilingual word similarity and transferred the sentence alignment problem into an extended earth mover's distance (EMD) problem. The proposed method achieved high accuracy in both 1-to-1 and many-to-many cases. Pre-training in general domain, the larger in-domain dataset and n-to-m sentence pairs benefit the NMT model. Fine-tuning in domain corpus helps the translation model learns more terminology and fits the in-domain style of text.
翻译:现有的神经机器翻译系统在某些语言的一般领域取得了接近人类水平的性能,但缺乏平行体在特定领域构成一个关键问题。在生物医学领域,平行体更难获得。这项工作提出了一种新的未经监督的句子调整方法,并探索了生物医学神经机翻译系统培训的特征。我们用简单而有效的方法构建双语词嵌入(BWES)来评估双语词相似性,并将句子调整问题转移到一个延伸的地球移动者距离(EMD)问题。拟议方法在1--1和许多到许多情况下都达到了高度精确性。一般领域的培训前,较大的内地数据集和n-m对口的判刑对等有利于NMT模式。对域体的微调有助于翻译模型学习更多的术语,适合内部文本的风格。