Unstructured data, such as images and videos, are growing stringently. The unprecedented growth of interconnected unstructured data can be viewed as graphs, then the node properties of graph can be the unstructured data. End users usually query graph data and unstructured data together in different real-world applications. Some systems and techniques are proposed to meet such demands. However, most of the previous work executes various tasks in different systems and loses the possibility to optimize such queries in one engine. In this work, we build a native graph database namely PandaDB to support querying unstructured data in the graph. We at first introduce CypherPlus, a query language to enable users to express complex graph queries for understanding the semantic of unstructured data. Next, we develop a cost model and related query optimization techniques to speed up the unstructured data processing as well as the graph querying processing. In addition, we optimize the data storage and index to speed up the query processing in a distributed setting. The PandaDB extends the graph database Neo4j implementation and provides the open-source version for commercial use in the cloud. The results show PandaDB can support a large scale of unstructured data query processing in a graph e.g., more than a billion unstructured data items. We also like to share the best practices while deploying the system into real applications.
翻译:无结构化数据,例如图像和视频,正在严格地增长。相互关联的无结构化数据史无前例的增长可以看成图表,然后图表的节点属性可以是非结构化数据。终端用户通常在不同的现实世界应用程序中一起查询图表数据和非结构化数据。一些系统和技术被提议满足这种需求。然而,以往的大部分工作在不同系统中执行不同任务,并丧失在一个引擎中优化查询的可能性。在这项工作中,我们建立了一个本地图形数据库,即PandaDB,以支持在图形中查询无结构化数据。我们首先引入了CypherPlus,一种查询语言,使用户能够表达复杂的图表查询,以理解非结构化数据的语义。接下来,我们开发了一个成本模型和相关的查询优化技术,以加速非结构化数据处理以及图表查询处理。此外,我们优化数据储存和索引,以加快在分布式设置中查询处理。PandaDB的配置和提供开放源版本,供在真实云层中商业使用。结果显示一个成本模型的模型和相关的数据结构中,我们也可以在进行更多的数字结构中提供一个非结构化的数据结构。