A data lake is a repository of data with potential for future analysis. However, both discovering what data is in a data lake and exploring related data sets can take significant effort, as a data lake can contain an intimidating amount of heterogeneous data. In this paper, we propose the use of schema inference to support the interpretation of the data in the data lake. If a data lake is to support a schema-on-read paradigm, understanding the existing schema of relevant portions of the data lake seems like a prerequisite. In this paper, we make use of approximate indexes that can be used for data discovery to inform the inference of a schema for a data lake, consisting of entity types and the relationships between them. The specific approach identifies candidate entity types by clustering similar data sets from the data lake, and then relationships between data sets in different clusters are used to inform the identification of relationships between the entity types. The approach is evaluated using real-world data repositories, to identify where the proposal is effective, and to inform the identification of areas for further work.
翻译:数据湖是一个数据储存库,有可能在今后进行分析。但是,发现数据湖中的数据和探索相关数据集都需要作出重大努力,因为数据湖可能包含令人恐惧的多种数据数量。在本文件中,我们提议使用系统推论支持数据湖中的数据解释。如果数据湖支持一种按部就班的模式,那么了解现有数据湖相关部分的模式似乎就是一个先决条件。在本文件中,我们使用可用于数据发现的近似指数,以告知由实体类型和它们之间的关系组成的数据湖模型的推论。具体方法通过将数据湖中的类似数据集分组,确定候选实体类型,然后使用不同组群的数据集之间的关系,用于确定实体类型之间的关系。该方法使用真实世界数据储存库进行评估,以确定建议的有效之处,并通报进一步开展工作的领域。