Data analysis often involves comparing subsets of data across many dimensions for finding unusual trends and patterns. While the comparison between subsets of data can be expressed using SQL, they tend to be complex to write, and suffer from poor performance over large and high-dimensional datasets. In this paper, we propose a new logical operator COMPARE for relational databases that concisely captures the enumeration and comparison between subsets of data and greatly simplifies the expressing of a large class of comparative queries. We extend the database engine with optimization techniques that exploit the semantics of COMPARE to significantly improve the performance of such queries. We have implemented these extensions inside Microsoft SQL Server, a commercial DBMS engine. Our extensive evaluation on synthetic and real-world datasets shows that COMPARE results in a significant speedup over existing approaches, including physical plans generated by today's database systems, user-defined function (UDF), as well as middleware solutions that compare subsets outside the databases.
翻译:数据分析通常涉及对多个层面的数据子集进行比较,以寻找不同趋势和模式。虽然可以用SQL来表示数据子集之间的比较,但数据子集的写法往往比较复杂,而且对大型和高维数据集的性能不佳。在本文中,我们提议为关系数据库建立一个新的逻辑操作员COMPARE,以简明地记录数据子集之间的查点和比较,并大大简化大量比较查询的表达方式。我们利用COMPARE的语义最优化技术扩展数据库引擎,以大大改进这类查询的性能。我们已经在微软 SQL 服务器(商业DBMS引擎)内实施了这些扩展。我们对合成和真实世界数据集的广泛评价表明,COMPARE大大加快了现有方法,包括今天的数据库系统产生的物理计划、用户定义功能(UDF)以及比较数据库外子集的中软件解决方案。