项目名称: 维吾尔语单元集优化关键技术研究及其在语音识别中的应用
项目编号: No.61462085
项目类型: 地区科学基金项目
立项/批准年度: 2015
项目学科: 计算机科学学科
项目作者: 米吉提·阿不里米提
作者单位: 新疆大学
项目金额: 45万元
中文摘要: 对于维吾尔语这类黏着性语言,其单词较长,在自然语言处理实际应用当中,辞典容量会爆炸性的增长,因而会大幅降低统计模型的效率。国内外研究,主要采用同现频率等简单的统计特性,以某个特定单元为基础进行优化,还未充分考虑到语言单元的层次化结构以及文本和语音之间的不同点,因此没能达到最优效果。本研究将建立多个层次的单元集为基础的统计模型,比较他们的效率,择优地选择中间单元集。在完成多种单元集优化方法的基础上,探索影响效率的统计学及语言学原因,实现崭新的自动平衡各类单元的机器学习方法,以及高性能的单元自动抽取方法。从而提高模型的效率和预测精度。以提高预测正确率为评价标准,实现大词汇量维吾尔语语音识别系统的语言及语音单元集优化。维吾尔语层次化的形态学结构为本研究提供了创新点和突破点。其思路完全能够推广到其他黏性语言和其他研究领域。单元选择是自然语言处理研究中的根本性问题,有重大研究意义和深入研究前景。
中文关键词: 自然语言处理;语音识别;维吾尔语;人工智能;语料库
英文摘要: In agglutinative languages like Uyghur, words have a variety of derivatives, and increase the vocabulary size explosively, causing OOV (out-of-vocabulary) and data sparseness problems. Therefore, significantly decreese NLP(natural language processing) system efficiency. Present unit optimization researches are based on a specific unit-set using some simple statistical properties like co-occurance frequency,not considering the linguistic properties like multilayer structure of unit-sets, and the difference between text and speech. So difficult to obtain the best optimized results. In this research, we will build some multilayer unit-set based statitistical models, and compare their efficiency in order to select the preferential unit-set between the linguistic layers. In the basis of unit optimization by selecting the characteristics of various unit-sets, we will investigated the deep statistical and linguistic reasons related with the model efficiency. And implement a novel machine learning approach which has the ability to balance between multilayers ,and some highly sophysticated statisticle models which can automatically extract the optimal unit-sets,thus to improve the model efficiency and prediction accuracy. Furthermore, These unit optimation methods will realize on the Uyghur LVCSR (large vocabulary continuous speech recognition) for both linguistic and acoustic units. The clear multilayer morphological structures of the Uyghur language provide a new breakpoint for this research. The proposed idea can be conviniently applied to other agglutinative languages and other NLP domains. Unit selection is the fundamental problem for various NLP researches, so this project has a great potential and prospect for future researches.
英文关键词: natural language processing;speech recognition;Uyghur;artificial inteligence;corpus