项目名称: 基于单语语料的无监督统计机器翻译模型研究
项目编号: No.61303181
项目类型: 青年科学基金项目
立项/批准年度: 2014
项目学科: 自动化技术、计算机技术
项目作者: 张家俊
作者单位: 中国科学院自动化研究所
项目金额: 23万元
中文摘要: 目前,几乎所有的统计机器翻译模型都建立在双语平行语料上。给定某一领域足够的双语平行语料,现有的统计机器翻译模型能够获得较为满意的翻译结果。然而,由于现实中双语平行语料很难收集,当面对一个缺乏双语平行语料的语言对或领域时,统计机器翻译质量就会急剧下降。相反地,绝大多数语言的各领域单语语料大量存在于网络之中,且易于获取。因此,本项目旨在充分利用网络中的大规模单语语料,研究并构造面向单语语料的基于短语的统计机器翻译模型。在自动获取源语言和目标语言同一领域的大规模单语语料后,本项目着重研究基于单语语料的概率化双语词典的无监督构建方法、双语短语翻译规则的学习方法以及翻译模型与调序模型的概率估计方法。本项目通过创造性地重新设计翻译模型的构造过程,力图突破双语平行语料对统计机器翻译的限制,使统计翻译得到更加广泛深远的发展。
中文关键词: 机器翻译;平行语料;单语数据;;
英文摘要: At present, almost all of the statistical machine translation models are trained based on bilingual corpus. Given enough bitext for a domain, the existing statistical machine translation models can achieve relatively satisfactory translation results. However, the parallel corpus is very difficult to collect, and thus the quality of statistical machine translaiton dramatically decreases when facing a language pair or domain without any bilingual resources. In contrast, each domain of most languages has large-scale monolingual corpus in the web and the monolingual data is easy to obtain. Therefore, this project aims at taking full advantage of the large-scale monolingual data in the web, and propose a phrase-based statistical translation method using only the monolingual corpus. After obtaining the monolingual data in the same domain for source and target language, this project focuses mainly on utilizing only monlingual corpus to study an unsupervised method for constructing a probabilistic bilingual lexicon, a method for learning phrase translation rules and a probability estimation method for translation model and reordering model. Through designing a novel construction process for translation model, this project tries to break through the bottleneck that statistical machine translation must depend on the bilin
英文关键词: machine translation;parallel corpus;monolingual data;;