基于Spark的大图数据最优子模式匹配查询方法研究

项目名称： 基于Spark的大图数据最优子模式匹配查询方法研究

项目编号： No.61502258

项目类型： 青年科学基金项目

立项/批准年度： 2016

项目学科： 自动化技术、计算机技术

项目作者： 彭云

作者单位： 齐鲁工业大学

项目金额： 20万元

中文摘要： 近年来，包含海量节点的大图数据得到了广泛的应用，如知识图谱和社交网络等。模式匹配查询是这些大图数据上应用非常广泛的查询方式。但现有的模式匹配查询方法在查询的查全率、多种模式约束的整合、多查询优化和分布式求解方法的可扩展性方面还存在一些不足。因此本项目提出基于新一代分布式计算框架Spark且支持多种模式约束和多查询优化的最优子模式匹配查询方法。该方法能在保证查准率的同时提高查全率；支持多种模式约束，有较广的应用范围；利用多个查询之间的共性进行优化，在多查询并发时整体的查询处理效率较高；基于更适合模式匹配计算的Spark框架，其分布式求解方法有较高的可扩展性。本项目不仅能为当前的大图数据应用提供一种有效且快速的查询方法，而且能进一步丰富大图数据模式匹配的理论，具有重要的理论和现实意义。本项目的研究成果还能为大图数据检索、挖掘等相关领域的实践与应用提供理论指导和方法借鉴。

中文关键词： 大图数据；子模式匹配查询；多查询优化；Spark框架

英文摘要： Recently, massive graph-structured data with billions of nodes has been adopted in several applications, such as knowledge graphs and social networks, etc. Pattern matching query is one of the most widely used approach to retrieve these graphs. Given a pattern graph Q and a data graph G, pattern matching query is to find a subgraph of G satisfying the pattern constrains of Q. ..However, existing pattern matching techniques have at least four technical challenges to effectively and efficiently retrieve the data graphs. Firstly, these queries require that all nodes of Q must match to nodes of G. But, such complete matching is often unachievable due to the great data incompleteness in G. Secondly, most existing works study pattarn matching query with a certain pattern constrain, which can only support a certain application scenario. They cannot efficiently support the applications with diverse scenarios. Thirdly, there are often many queries submitted to be processed concurrently in real applications, and the queries usually have common expressions. But, most existing works just focus on processing a single query. It is under-optimized for the real applications. Fourthly, most existing distributed pattern matching methods are based on the traditional distributed frameworks, such as MapReduce and Vertex-oriented framework. But, these frameworks do not suit the characteristics of pattern matching computation and hence the methods cannot scale to the ever-growing massive graphs. ..To address these challenges, this project studies the optimal subpattern matching query with multi-query optimization using Spark. The novelties of this research are as follows. Firstly, the optimal subpattern matching query allows incomplete matching and therefore is of higher effectiveness on the data graphs having great data incompleteness. Secondly, this project integrates the processing of multiple pattern constrains. Therefore, our technique can efficiently support the applications with diverse scenarios. Thirdly, this project studies multi-query optimization techniques, which can significantly improve the performance of the query processing in real applications. Fourthly, this project adopts the recently proposed and well-accepted Spark framework. It is more suitable to the characteristics of pattern matching computation and can achieve higher scalability. ..The research results of this project can provide an effective and efficient way for retrieving real massive data graphs, and have great theoretical and practical significance. The results can also guide the practices and applications of pattern matching techniques to massive graph retrieval and mining.

英文关键词： Massive graph;Subpattern matching query;Multi-query optimization;Spark framework

成为VIP会员查看完整内容