项目名称: 基于篇章语义的文档级统计机器翻译研究
项目编号: No.61305088
项目类型: 青年科学基金项目
立项/批准年度: 2014
项目学科: 自动化技术、计算机技术
项目作者: 贡正仙
作者单位: 苏州大学
项目金额: 25万元
中文摘要: 机器翻译是自然语言理解的一个研究热点,能有效地促进信息共享,具有广泛的研究价值和应用前景。统计机器翻译(SMT)是目前主流的机器翻译技术,但即使面对的是文档,大多数SMT系统也是以句子为翻译单位孤立地进行翻译。本项目将在前期研究的基础上,重点研究文档级SMT技术亟需解决的两个核心问题:一是分析现有SMT系统在文档翻译的过程中发生了哪些篇章语义信息的缺失;二是研究如何在SMT系统中有效控制这种缺失。主要研究内容包括:1)充分发掘能够提高SMT质量的文档信息,设计有效的篇章语义控制框架;2)探索文档级SMT的翻译机制,利用缓存技术实现能够进行文档级翻译的基准系统,重点研究中心句驱动的全文翻译并避免翻译的过度一致;3)探索能客观反映文档级SMT性能的评价标准。目前针对句子的评价标准不能客观反映文档级翻译所带来的质量改变,相应评价标准的研究能积极推动此项研究的发展。
中文关键词: 文档级机器翻译;文档级机器翻译评测;中心句驱动;偏向选择;
英文摘要: Machine translation is a hot research topic in Natural Language Understanding. It can effectively promote information sharing and thus has wide application and research value. As the mainstream of machine translation technology, most Statistical Machine Translation (SMT) systems translate documents sentence by sentence under strict independence assumptions. This proposal focuses on two key problems to document-level SMT : one is to analyze what discourse-level information is lost in traditional SMT systems, and another is on how to control such loss in these SMT systems. The main contents of this proposal include: 1) exploiting related document-level knowledge which can improve SMT performance, and design an effective control framework to keep discourse semantic unchanged during translation; 2) exploring a reliable framework for document-level SMT. First, a baseline is built to perform document-level translation with cache technology. Then, a topic sentences-driven SMT is proposed with consistency control; 3) exploring proper evaluation metrics for document-level SMT systems. Current evaluation metrics are only suitable to sentence-level SMT and can not objectively reflect quality changes at document level. Research on the development of corresponding evaluation metrics can thus actively promote the research
英文关键词: document-level MT;document-level MT evaluation;summary-driven;selectional preference;