项目名称: 面向微博的实时流数据处理平台和查询处理关键技术研究
项目编号: No.61272092
项目类型: 面上项目
立项/批准年度: 2013
项目学科: 自动化技术、计算机技术
项目作者: 禹晓辉
作者单位: 山东大学
项目金额: 80万元
中文摘要: 微博是一种新型的社会化媒体。通过微博平台,用户可以用简短的文字发表信息,实现信息的实时共享和用户间的交流。以Twitter为代表的微博平台在过去几年间得到了迅猛发展,也带来了大量的数据,如Twitter每天新增的数据量达到了8TB。如何从这样海量的数据中查询感兴趣的微博或用户,是一个必须要解决的问题。本项目从实际应用的需求出发,针对微博数据量大、更新频率高、实时性强、用户信息丰富等特点,从底层的计算平台及其上的查询处理两个方面展开研究,预期做出如下贡献:(1)提出面向微博的实时流数据处理平台的体系结构及关键技术,实现数据处理的高性能、高可用性和高可靠性;(2)提出新的微博搜索结果排序策略,并研究针对空间敏感型查询的索引结构和算法,实现查询的高效实时处理;(3)提出基于微博内容的相似用户查询这一问题,给出索引结构和结果的增量计算方法,实现更精准和高效的用户推荐。
中文关键词: 微博;搜索;社会媒体;数据管理;实时处理
英文摘要: Microblog is an emerging type of social media. With a microblogging platform, users can publish information with very short text, leading to real-time sharing of information and communication between users. Microblogging platforms, most notably Twitter, have undergone a tremendous growth in the past few years, generating a huge volume of data. It is estimated that the data volume at Twitter is increasing at a rate of 8TB/day. It therefore becomes critical to retrieve microblogs or users of interest from this vast amount of data. This project aims to address this problem from two aspects: computation infrastructure and query processing. We expect to make the following contributions with this project: (1) we propose the system architecture of a real-time stream processing platform and a set of key techniques, to achieve high performance, high availability, and high reliability; (2) we develop new ranking strategies for microblog search, and for spatially sensitive queries, develop index structures and query processing methods for real-time processing of queries; and (3) we define the problem of similarity search of users based on the contents of microblogs, and propose index structures and methods for incremental update of search results, for more effective user recommendation.
英文关键词: microblog;search;social media;data management;real-time processing