项目名称: 机器翻译中大规模异类特征的迁移学习
项目编号: No.61300115
项目类型: 青年科学基金项目
立项/批准年度: 2014
项目学科: 自动化技术、计算机技术
项目作者: 刘宇鹏
作者单位: 哈尔滨理工大学
项目金额: 23万元
中文摘要: 传统的机器翻译系统融合是提高级器翻译性能的一种重要手段,但是传统的融合模型并没有给出一体化模型的定义,同时也没有考虑机器翻译系统差异性给系统融合造成的影响和传统训练方法的局限性。本课题利用迁移学习的强大理论基础,从迁移学习中两个基本问题(任务和领域)出发,把造成差异性(异类特征)的原因分为异类机器翻译系统/异类标签系统(从任务角度出发)和异类语料(从领域出发),且采用了大规模特征训练算法,克服了传统训练方法对于特征数量的限制。本课题主要先进行一体化模型定义和效率的研究;对于异类机器翻译系统/标签系统,进行基于特征/参数大规模融合;对于异类语料训练通过公共特征的选择,把公共特征加入到融合前的机器翻译系统中来进行融合。而且对于异类机器翻译系统问题研究,能够更好的认识到每个类型机器翻译的优缺点;对于异类标签系统和异类语料的研究,能够更好的认识到异类标签系统和异类语料对于机器翻译系统的影响。
中文关键词: 迁移学习;异类特征;短语/规则嵌入;领域迁移;深度递归
英文摘要: Conventional system combination is an important way of improving machine translation performance, but it can't consider the basis reason of system diversity and training method, and conventional system combination doesn't give integration framework. According to two fundamental problems including task and domain, the reason of resulting in heterogeneous feature are heterogeneous labeling system/machine translation from task and heterogeneous training corpora from domain. We use online training as large-scale heterogeneous feature training method because minimum error rate training is sensitive to the feature number. The content of the subject is as follows: 1) research on the integration model of machine translation and pruning technique of the model; 2) transfer learning of heterogeneous machine translation/labeling system; 3) transfer learning of heterogeneous training corpora. The researches on heterogeneous machine translation obtain a better understanding of the advantages and disadvantages of each type of machine translation. The researches on heterogeneous labeling system and training corpora obtain a better understanding of their impact on machine translation system.
英文关键词: Transfer Learning;Heterogeneous Feature;Phrase/Rule Embedding;Domain Transfer;Deep Recursion