Graph entity dependencies (GEDs) are novel graph constraints, unifying keys and functional dependencies, for property graphs. They have been found useful in many real-world data quality and data management tasks, including fact checking on social media networks and entity resolution. In this paper, we study the discovery problem of GEDs -- finding a minimal cover of valid GEDs in a given graph data. We formalise the problem, and propose an effective and efficient approach to overcome major bottlenecks in GED discovery. In particular, we leverage existing graph partitioning algorithms to enable fast GED-scope discovery, and employ effective pruning strategies over the prohibitively large space of candidate dependencies. Furthermore, we define an interestingness measure for GEDs based on the minimum description length principle, to score and rank the mined cover set of GEDs. Finally, we demonstrate the scalability and effectiveness of our GED discovery approach through extensive experiments on real-world benchmark graph data sets; and present the usefulness of the discovered rules in different downstream data quality management applications.
翻译:图形实体依赖度(GEDs)是新颖的图形限制,对财产图表来说,是关键和功能依赖度(GEDs),在很多真实世界的数据质量和数据管理任务中,包括社会媒体网络和实体分辨率的实况检查,这些都非常有用。在本文件中,我们研究了GEDs的发现问题 -- -- 在一个特定图表数据中找到一个有效的GED的最小覆盖面。我们把问题正规化,并提出了克服GED发现方面重大瓶颈的有效和高效方法。特别是,我们利用现有的图形分割算法,使GED范围能够快速发现,并在候选人依赖性极高的大空间上采用有效的调整战略。此外,我们还根据最低描述长度原则,为GEDs定出了一个有趣的计量标准,以对GEDs所埋的覆盖物进行评分和排序。最后,我们通过对现实世界基准图表数据集的广泛试验,展示了我们GED发现方法的可扩展性和有效性;我们介绍了所发现规则在不同下游数据质量管理应用中的有用性。