项目名称: 互联网藏文文本资源挖掘及语料抽取关键技术研究
项目编号: No.61202219
项目类型: 青年科学基金项目
立项/批准年度: 2013
项目学科: 计算机科学学科
项目作者: 刘汇丹
作者单位: 中国科学院软件研究所
项目金额: 23万元
中文摘要: 藏文信息处理目前面临着基础语料匮乏的困境,互联网为我们提供了大量的藏文文本资源,是藏文语料的一个重要来源。本项目将利用网络爬虫与藏文自动编码识别技术,自动从海量的互联网资源中挖掘藏文资源,并配合人工分析,考察藏文文本资源的分布情况和存在形式,发现有利用价值的藏文文本资源;我们将建立藏文搜索引擎原型系统,对互联网藏文资源进行有效索引,以便于挖掘包含预设模式的网络资源;将研究藏文网页的全自动篇章抽取技术和汉藏双语平行语料的自动发现技术,并自动采集藏文篇章语料和汉藏双语平行语料;本项目将建立藏文文本资源URL库、藏文篇章语料库、互联网藏文词(短语)库、汉藏双语平行语料库,并基于大规模藏文语料进行词频统计、训练藏文语言模型,为藏文信息处理的研究提供基础资源。
中文关键词: 藏文;语料库;数据挖掘;藏文分词;词性标注
英文摘要: Tibetan information processing is currently faced with the predicament of lack of basic corpus, the Internet provides us with a large number of Tibetan text resources, Which is an important source of the Tibetan corpus.In this program, first of all, we will exploit Tibetan text resource from the vast amounts of resources over the internet automatically, with web crawler and Tibetan automatic encoding recognition technology. Analyzing those resources, we will have a comprehensive understanding on the distribution, existence form of Tibetan text resource over the internet, and where and how they can be used in Tibetan natural language processing task. Second, we will build a Tibetan search engine, and effectively index those Tibetan Text resources. With it, we can check whether or not any Tibetan text resources meeting a predefined pattern exist on the internet. Then, we will make research on the automatic extracting technology of Tibetan news and articals, including their title, author, time, content and other information. The automatic detecting technology of Chinese-Tibetan parallel text is also one of our interesting. We will take advantage of the Tibetan search engine and Chines-Tibetan dictionary to realize it. In addition, applying all those technologies, we will build many Tibetan related corpora, such as
英文关键词: Tibetan;Corpus;Data mining;Tibetan word segmentation;Part-of-speech tagging