项目名称: 维、哈、柯跨语言内容过滤关键技术研究
项目编号: No.61262062
项目类型: 地区科学基金项目
立项/批准年度: 2013
项目学科: 自动化技术、计算机技术
项目作者: 吐尔地·托合提
作者单位: 新疆大学
项目金额: 46万元
中文摘要: 在中、英文完全不同的语言环境下,如何对维、哈、柯多语种文本信息进行有效过滤,一方面要从更大的多语言资源空间中获取更丰富、有用的信息,另一方面要彻底抵制有害信息的跨语言传播,已成为亟待解决的科学问题。本课题将多语种(维、哈、柯)、多模式(多文字,多编码)文本信息作为内容过滤对象,将维、哈、柯文自然语言处理技术及信息检索中的跨语言技术引入到内容过滤中,提出一种多语种多模式自适应内容过滤的跨语言模型,并围绕该模型中的关键理论和技术,开展多语种(维、哈、柯文)、多模式(多文字、多编码)文本转换,文本表示(特征提取,特征选取),自适应过滤判定(模版初始化,阈值设置,相关性判定),自适应过滤学习(模版学习,阈值学习)等4个方面的探索性,创新性研究,最终建立维、哈、柯跨语言内容过滤的理论体系及技术基础,并通过研发有关算法、工具和试验平台,在少数民族多语种信息处理相关领域中应用验证本课题取得的研究成果。
中文关键词: 一语多文转换;语义分词;文本过滤;词向量;多模式匹配
英文摘要: In the completely different language environment with Chinese and English, how to effectively filtering the Uyghur, Kazak and Kyrgyz multilingual text information, On the one hand to Get more rich and useful information from large space of Multilanguage resources, On the other hand should be thoroughly resist the harmful information spread by cross language, has become the scientific problems to be solved.This project will be take the multilingual (Uyghur, Kazak, Kyrgyz) multi-pattern (multi character set, multi coding) text message as a content filtering objects, introduced Uyghur, Kazak and Kyrgyz natural language processing and cross language information retrieval technology into the content filtering, presents a multilingual multi-pattern cross language model for adaptive content filtering, and around the key theory and technology of this model, carried out multilingual (Uyghur, Kazak, Kyrgyz) multi-pattern (multi character set, multi coding) text conversion, text representation (feature extraction, feature selection),adaptive filtering judgement(profile initialization, threshold initialization, Correlation judgment),adaptive learning(profile learning, threshold learning)so on four aspects of exploratory and innovation research, and finally establish Uyghur, Kazak and Kyrgyz multilingual multi-pattern conte
英文关键词: Multilanguage conversion;semantic segmentation;text filtering;word embedding;multi-pattern matching