基于形态和多词的有限语料蒙汉互译调序优化方法

项目名称： 基于形态和多词的有限语料蒙汉互译调序优化方法

项目编号： No.61502445

项目类型： 青年科学基金项目

立项/批准年度： 2016

项目学科： 计算机科学学科

项目作者： 陈雷

作者单位： 中国科学院合肥物质科学研究院

项目金额： 20万元

中文摘要： 蒙汉双语存在形态和语序两方面的显著差异，译文语序混乱是蒙汉互译系统的主要错误之一。基于大规模语料进行统计训练的调序方法在目前蒙汉语言资源有限的条件下所取得的效果有限。.针对上述问题，本项目结合语言学知识和统计方法，将在不同语言单位粒度上挖掘有限蒙汉语料所蕴含的双语知识，对蒙汉互译系统的调序进行优化，拟重点开展：1）研究基于小规模人工切分语料，以增强特征模版整合有监督和无监督的方法，实现半监督的切分以获取蒙古语细粒度的形态信息；2）研究基于形态句法结构模式与多重过滤的多词表达式抽取方法，实现在有限蒙汉语料中挖掘粗粒度的双语信息；3）研究分别利用形态信息和多词表达式对蒙汉互译系统的调序进行优化，指导调序方向，增强长距离调序能力，最终提高译文质量。通过以上研究，探索在有限语料条件下结合语言学知识和统计方法高效挖掘双语知识以优化系统调序能力的技术，为我国语言资源有限的民汉机器翻译研究提供技术参考。

中文关键词： 形态；多词表达式；有限语料；调序；蒙古语

英文摘要： Due to significant differences in the morphology and word order between Mongolian and Chinese, the error of the translation word order is one of the major errors in Mongolian-Chinese translation systems. The statistical methods based on the large-scale training corpus cannot achieve ideal reordering results of Mongolian-Chinese translation under the condition that the existing Chinese-Mongolian parallel corpus is extremely limited. .To solve these problems, this project intends to combine linguistic knowledge and statistical methods to mining bilingual knowledge in different sizes of language unit from the limited of Mongolian-Chinese parallel corpus and optimize the reordering model in Mongolian-Chinese translation systems. Much attention will be paid to: 1) Based on a small artificial segmentation corpus, study the integration of supervised and unsupervised methods via feature set augmentation, achieving the semi-supervised morphological segmentation for Mongolian morphological information acquisition; 2) Study on the Multi-word expressions extraction based on morphosyntactic patterns and multiple filters, achieving bilingual multi-word expressions in the limited Mongolian-Chinese parallel corpus; 3) Study on the utilization of morphological information and bilingual multi-word expressions to optimize the reordering model in Mongolian and Chinese translation system, achieving the ordering direction guidance and long distance reordering ability enhancement, and ultimately improving the translation quality. This study will explore how to efficiently mine the bilingual knowledge in limited Mongolian-Chinese corpus to optimize the capabilities of the reordering model, via combining linguistic knowledge and statistical methods, besides providing technical reference for studies on the translation between other under-resourced languages and Chinese.

英文关键词： Morphology;Multi-word Expressions;Limited Corpus;Reordering;Mongolian

成为VIP会员查看完整内容