项目名称: 基于语言特征的网络用户身份属性识别方法研究
项目编号: No.61272275
项目类型: 面上项目
立项/批准年度: 2013
项目学科: 自动化技术、计算机技术
项目作者: 钱铁云
作者单位: 武汉大学
项目金额: 80万元
中文摘要: 网络用户的注册资料经常含有大量残缺或虚假信息,给网络犯罪预防、舆论监管及商业推广等关键应用带来极大障碍。现有研究大多限于年龄和著作者身份识别,且主要集中在英文领域。本项目将研究网络用户身份属性的识别方法,从用户发表的文档中,通过分析其语言(特别是中文的)特征,区分出性别、年龄段、籍贯、职业和著作者身份等多重属性。首先,拟研究数据优化措施,以消除用户发表文档中以大众话题形式存在的噪声及类别倾斜问题的不良影响。其次,拟研究高效的特征选择算法和基于少量种子的标注特征扩展方法,以解决网络环境下语言特征的数目庞大及标注困难问题。最后,拟研究用户身份属性的区分方法,以获得语言特征-分类任务-分类算法-评价标准的系统认识,并为作者众多、测试域开放的网络著作者身份区分问题建立全新的算法框架。 本项目的研究有望揭示个人语言特征和身份属性之间的内在联系,并为网络用户身份区分这一重要应用提供理论和技术基础。
中文关键词: 用户身份属性;语言特征;区分算法;相似空间;协同训练
英文摘要: It is quite normal for online users to provide incomplete and fake information in order to hide one's true identity. This greatly hinders some critical applications such as the network crime prevention, opinion supervision and targeted promotion. It would be very useful if user profiles can be checked on the basis of text analysis. Current researches are mainly focused on the age and authorship attribution based on English corpus. This project will recognize a user's gender, age, profession, birthplace, and authorship using the linguistic features, especially Chinese features, which are extracted from user posted documents. First, we will present some data cleansing methods in order to eliminate the noises in terms of common events and the problem of skewed classes. Second, we will develop approaches for feature selection and feature set expansion to deal with the problem of huge number of features and high cost of manually labeling. Last, we will systematically compare the performance of existing classification methods to get insight into the linguistic features, classification tasks, classification algorithms, and evaluation metrics. Meanwhile, we will propose a new framework for authorship attribution of online users. This study will reveal the relationship between personal linguistic features and identit
英文关键词: users' identity attributes;linguistic features;classification methods;similarity space;co-training and tri-training