多领域网络文本数据的自适应结构化分类方法研究

项目名称： 多领域网络文本数据的自适应结构化分类方法研究

项目编号： No.61300063

项目类型： 青年科学基金项目

立项/批准年度： 2014

项目学科： 自动化技术、计算机技术

项目作者： 孙栩

作者单位： 北京大学

项目金额： 25万元

中文摘要： 网络文本数据来自多个不同的领域，形成了一个领域高度多元化的文本数据集，给自然语言处理带来新挑战。现有结构化分类技术在领域多元化的网络文本数据上缺乏跨领域的自适应学习能力。为了解决此问题，本项目拟研究多领域网络文本数据的特点，提出具有自适应能力的结构化学习方法。主要研究内容如下：（1）把每个领域作为一个和其他领域相关的任务，研究多任务学习技术用于处理多领域网络文本数据。该方法对领域关联度进行自适应学习，从领域相关性自动建模的角度处理跨领域的网络文本。（2）把领域特性作为任务的隐含信息，研究条件隐变量模型对多领域数据的自适应处理能力。该方法对领域的隐含信息进行统一建模，从而能够自动融合多领域数据，实现高效的结构化分类目标。（3）不管是多任务学习还是条件隐变量模型，处理多领域网络文本都面临复杂度高、速度慢的问题，我们研究高速的优化算法解决这个问题。

中文关键词： 网络文本；多领域数据；自适应；结构化分类；

英文摘要： The web texts are from massive domains with diversified characteristics, which lead to a new challenge for natural language processing studies. Existing structured classification methods lack the adaptive learning ability given the cross-domain web text data. To deal with this issue, this project aims at studying characteristics of multiple domain web texts, and to propose adaptive structured learning methods based on the multiple domain data. The major research points are as follows: (1) We take each domain as a task that relates to other domains/tasks, and study multi-task learning method for modeling multi-domain web texts. This method can learn the domain similarities adaptively and automatically, and can process cross-domain web texts automatically via modeling the relationship of various domains. (2) We take domain specific characteristics as hidden information, and to study latent conditional models for adaptively processing multi-domain data. This method can universally model the hidden information of different domains, so that the multi-domain data can be automatically combined for improving the performance of structured classification tasks.(3) The multi-task learning methods and latent conditional models face the problem of high complexity and low efficiency on modeling multi-domain web texts. We stud

英文关键词： web text；multiple domain data；adaptive；structured classification；

成为VIP会员查看完整内容