Identifying a project-join view (PJ-view) over collections of tables is the first step of many data management projects, e.g., assembling a dataset to feed into a business intelligence tool, creating a training dataset to fit a machine learning model, and more. When the table collections are large and lack join information--such as when combining databases, or on data lakes--query by example (QBE) systems can help identify relevant data, but they are designed under the assumption that join information is available in the schema, and do not perform well on pathless table collections that do not have join path information. We present a reference architecture that explicitly divides the end-to-end problem of discovering PJ-views over pathless table collections into a human and a technical problem. We then present Niffler, a system built to address the technical problem. We introduce algorithms for the main components of Niffler, including a signal generation component that helps reduce the size of the candidate views that may be large due to errors and ambiguity in both the data and input queries. We evaluate Niffler on real datasets to demonstrate the effectiveness of the new engine in discovering PJ-views over pathless table collections.
翻译:在收集表格方面确定项目join视图(PJ-view)是许多数据管理项目的第一步,例如,将数据集汇集成一个商业智能工具,创建一个适合机器学习模型的培训数据集,等等。当表格收藏量大,缺乏信息组合,例如数据库合并时,或者在数据湖查询系统(QBE)中,发现项目join视图(PJ-view),这是许多数据管理项目的第一步,例如,将数据集汇集成一个数据库,输入到一个商业智能工具,创建一个培训数据集以适应一个机器学习模型,等等。当表格收藏量大,缺少信息组合数据库,或数据湖泊查询(QBE)系统(QBE),则在设计时可以帮助识别相关数据,但是,它们的设计所依据的假设是,将信息组合在系统系统中的信息组合起来,在没有路径信息连接路径信息的无路径的表格收藏中,无法很好地运行。我们在一个参考架构中将发现 PJ- 视图的端端端端端到端到端到端的问题明确分为一个人类和技术问题。然后我们介绍Niffler,这是用来解决技术问题的系统。我们为Niffler的主要组成部分引入算法,我们引入了Niffler,包括一个信号生成中由于数据和输入查询中的新表格的路径的采集方式中的新引擎。