与平行文件一致的句子帮助生物医学机器翻译 (Sentence Alignment with Parallel Documents Helps Biomedical Machine Translation)

The existing neural machine translation system has achieved near human-level performance in general domain in some languages, but the lack of parallel corpora poses a key problem in specific domains. In biomedical domain, the parallel corpus is less accessible. This work presents a new unsupervised sentence alignment method and explores features in training biomedical neural machine translation (NMT) systems. We use a simple but effective way to build bilingual word embeddings (BWEs) to evaluate bilingual word similarity and transferred the sentence alignment problem into an extended earth mover's distance (EMD) problem. The proposed method achieved high accuracy in both 1-to-1 and many-to-many cases. Pre-training in general domain, the larger in-domain dataset and n-to-m sentence pairs benefit the NMT model. Fine-tuning in domain corpus helps the translation model learns more terminology and fits the in-domain style of text.

翻译：现有的神经机器翻译系统在某些语言的一般领域取得了接近人类水平的性能,但缺乏平行体在特定领域构成一个关键问题。在生物医学领域,平行体更难获得。这项工作提出了一种新的未经监督的句子调整方法,并探索了生物医学神经机翻译系统培训的特征。我们用简单而有效的方法构建双语词嵌入(BWES)来评估双语词相似性,并将句子调整问题转移到一个延伸的地球移动者距离(EMD)问题。拟议方法在1--1和许多到许多情况下都达到了高度精确性。一般领域的培训前,较大的内地数据集和n-m对口的判刑对等有利于NMT模式。对域体的微调有助于翻译模型学习更多的术语,适合内部文本的风格。

相关内容

Machine Translation

关注 209

机器翻译（Machine Translation）涵盖计算语言学和语言工程的所有分支，包含多语言方面。特色论文涵盖理论，描述或计算方面的任何下列主题:双语和多语语料库的编写和使用，计算机辅助语言教学，非罗马字符集的计算含义，连接主义翻译方法，对比语言学等。官网地址：http://dblp.uni-trier.de/db/journals/mt/

【Facebook AI】无监督机器翻译，336页ppt，Unsupervised Machine Translation

专知会员服务

19+阅读 · 2020年11月17日

多语言神经机器翻译综述论文，34页pdf，A Comprehensive Survey of Multilingual Neural Machine Translation

专知会员服务

19+阅读 · 2020年4月25日

【机器学习最优化课程笔记】Optimization for Machine Learning，36页pdf

专知会员服务

117+阅读 · 2020年3月25日

【Google】无监督机器翻译，Unsupervised Machine Translation

专知会员服务

36+阅读 · 2020年3月3日