This work is motivated by two key facts. First, it is highly desirable to be able to learn and perform knowledge discovery and analytics (LKD) tasks without the need to access raw-data tables. This may be due to organizations finding it increasingly frustrating and costly to manage and maintain ever-growing tables, or for privacy reasons. Hence, compact models can be developed from the raw data and used instead of the tables. Second, oftentimes, LKD tasks are to be performed on a (potentially very large) table which is itself the result of joining separate (potentially very large) relational tables. But how can one do this, when the individual to-be-joined tables are absent? Here, we pose the following fundamental questions: Q1: How can one "join models" of (absent/deleted) tables or "join models with other tables" in a way that enables LKD as if it were performed on the join of the actual raw tables? Q2: What are appropriate models to use per table? Q3: As the model join would be an approximation of the actual data join, how can one evaluate the quality of the model join result? This work puts forth a framework, Model Join, addressing these challenges. The framework integrates and joins the per-table models of the absent tables and generates a uniform and independent sample that is a high-quality approximation of a uniform and independent sample of the actual raw-data join. The approximation stems from the models, but not from the Model Join framework. The sample obtained by the Model Join can be used to perform LKD downstream tasks, such as approximate query processing, classification, clustering, regression, association rule mining, visualization, and so on. To our knowledge, this is the first work with this agenda and solutions. Detailed experiments with TPC-DS data and synthetic data showcase Model Join's usefulness.
翻译:这项工作是由两个关键事实驱动的。 首先,非常可取的做法是能够学习和开展知识发现和分析(LKD)任务,而不需要访问原始数据表格。这可能是由于各组织发现管理和保持不断增长的表格越来越令人沮丧,而且费用越来越高,或者由于隐私原因。因此,可以从原始数据中开发压缩模型,而不是使用表格。第二,通常情况下,LKD任务将在一个(可能非常大)的原始表格上进行,而这本身就是合并单独(可能非常大)的关系表的结果。但是,当个人要加入的在线表格不存在时,人们怎么能这样做呢?这里,我们提出以下基本问题:Q1:如何用“join model” 来管理和保持不断增长的表格,或者用“join 模式与其他表格一起” 的方式,使LKD任务能够像在实际的原始表格的组合中进行那样进行那样? Q2: 与直接的模型和直接的模型一起使用什么样的框架? Q3:当模型加入时,从这个模型可以接近实际的直观数据列表, 将这个模型加入这个模型, 如何使用这些流程, 将这种模型作为模型的流程的合并。