项目名称: 海量Web用户生成内容物化关键技术
项目编号: No.61462017
项目类型: 地区科学基金项目
立项/批准年度: 2015
项目学科: 自动化技术、计算机技术
项目作者: 杨青
作者单位: 桂林电子科技大学
项目金额: 45万元
中文摘要: Web已演变成以用户为核心的生态系统,用户生成内容成为Web的主要内容。Web用户生成内容的有效抽取与集成管理(物化)成为Web数据向Web价值转化的关键环节。本项目主要针对大量异构站点中模式不一的用户生成内容,研究其从Web页面到本地物化过程中的关键技术。针对用户生成内容抽取自动化和自适应这一挑战性问题,重点研究迁移学习和贝叶斯逻辑推演相融合的抽取规则学习技术,提供不同环境下的自动自适应抽取技术方案;针对用户生成内容的表现多样性特征和其分析应用中存在的共性访问需求,研究以用户ID、时间轴等为基本参考维度的分布式数据存储模型和索引技术,解决海量用户生成内容的存储和访问优化等难点问题;同时,建立Web用户生成内容物化原型系统,并进行广泛密集实验,验证系统效率。本项目旨在通过对用户生成内容物化关键技术的研究,建立统一的用户生成内容管理平台,为提升数据到价值的转换效率服务。
中文关键词: 用户生成内容抽取;Web数据增量抽取;海量数据管理
英文摘要: Web has evolved into a user-centric ecosystem, user-generated content become the main content of the Web. The extraction and integration management of user-generated content, called materialization of user-generated content, becomes a key link of converting Web data into Web value. This project focuses on various user-generated content existing in a large number of heterogeneous sites, and studies the key technologies related with its materialization. Firstly, aiming at the challenging issues on automated and adaptive extraction of user-generated content, this project will put forward an original extraction method, which concentrates on discovering extraction rules by integrating transfer learning and Bayesian logical deduction. The proposed method will provide a solution for automated and adaptive extraction of user-generated content existing in different contexts. Secondly, based on the full analysis of both the diversification of user-generated content and the analysis-specific and application-specific access requirements, this project will carry on key study to both distributed storage model and index technologies, and these technologies should give a full consideration on user ID, timeline, and other access dimensions. Finally, this project will integrate the above technologies and establish a prototype system to realize the materialization of Web user-generated content. This system will also be used to carry out a wide range of intensive experiments and verify its effectiveness and efficiency. Based on the key technologies on materialization of user-generated content, this project can provide a unified data management platform of massive Web user-generated content, which will give a great improvement for data into value.
英文关键词: user-generated content extraction;incremental Web data extraction;massive data management