Heterogeneous information networks (HINs) represent different types of entities and relationships between them. Exploring, analysing, and extracting knowledge from such networks relies on metapath queries that identify pairs of entities connected by relationships of diverse semantics. While the real-time evaluation of metapath query workloads on large, web-scale HINs is highly demanding in computational cost, current approaches do not exploit interrelationships among the queries. In this paper, we present ATRAPOS, a new approach for the real-time evaluation of metapath query workloads that leverages a combination of efficient sparse matrix multiplication and intermediate result caching. ATRAPOS selects intermediate results to cache and reuse by detecting frequent sub-metapaths among workload queries in real time, using a tailor-made data structure, the Overlap Tree, and an associated caching policy. Our experimental study on real data shows that ATRAPOS accelerates exploratory data analysis and mining on HINs, outperforming off-the-shelf caching approaches and state-of-the-art research prototypes in all examined scenarios.
翻译:不同种类的信息网络(HINs)代表着不同种类的实体和它们之间的关系。探索、分析和从这些网络中提取知识依赖于找出不同语义关系关联的对等实体的元虫问询。虽然对大型、网络规模的HINs上的元虫问询工作量的实时评价在计算成本方面要求很高,但目前的方法并不利用查询之间的相互关系。在本文件中,我们介绍了对元虫问询工作量进行实时评价的新方法,利用高效的稀释矩阵增殖和中间结果堆积的组合。ATRAPOS通过实时发现工作量查询中常见的次元虫路选择中间结果,使用定制的数据结构、重叠树和相关的累积政策。我们对实际数据的实验研究表明,ATRAPOS加快了对HINs的探索性数据分析和开采,在所有所审查的情景中,优于现成的缓冲法和最新研究原型。