With data pipeline tools and the expressiveness of SQL, managing interdependent materialized views (MVs) are becoming increasingly easy. These MVs are updated repeatedly upon new data ingestion (e.g., daily), from which database admins can observe performance metrics (e.g., refresh time of each MV, size on disk) in a consistent way for different types of updates (full vs. incremental) and for different systems (single node, distributed, cloud-hosted). One missed opportunity is that existing data systems treat those MV updates as independent SQL statements without fully exploiting their dependency information and performance metrics. However, if we know that the result of a SQL statement will be consumed immediately after for subsequent operations, those subsequent operations do not have to wait until the early results are fully materialized on storage because the results are already readily available in memory. Of course, this may come at a cost because keeping results in memory (even temporarily) will reduce the amount of available memory; thus, our decision should be careful. In this paper, we introduce a new system, called S/C, which tackles this problem through efficient creation and update of a set of MVs with acyclic dependencies among them. S/C judiciously uses bounded memory to reduce end-to-end MV refresh time by short-circuiting expensive reads and writes; S/C's objective function accurately estimates time savings from keeping intermediate data in memory for particular periods. Our solution jointly optimizes an MV refresh order, what data to keep in memory, and when to release data from memory. At a high level, S/C still materializes all data exactly as defined in MV definitions; thus, it doesn't impact any service-level agreements. In our experiments with TPC-DS datasets (up to 1TB), we show S/C's optimization can speedup end-to-end runtime by 1.04x-5.08x with 1.6GB memory.
翻译:借助数据管道工具和SQL的表达能力,管理相互依赖的材料视图(MVs)变得越来越容易。这些MV在新数据摄入时(例如,每天)反复更新,数据库管理员可以观察不同类型的更新(全量和增量)和不同系统(单节点、分布式、云托管)的性能指标(例如,每个MV的刷新时间,磁盘空间占用)的一致方式。一个被忽视的机会是,现有的数据系统将那些MV更新视为独立的SQL语句,而没有充分利用它们的依赖信息和性能度量。然而,如果我们知道SQL语句的结果将立即被用于随后的操作,那么这些随后的操作就不必等待早期结果在存储器上完全实例化,因为结果已经在内存中准备好了。当然,这可能会有代价,因为在内存中(即使是暂时地)保留结果将减少可用内存;因此,我们的决策应该谨慎。在本文中,我们介绍了一个名为S/C的新系统。它通过有效地创建和更新一组彼此之间呈非循环依赖关系的MV来解决这个问题。S/C明智地使用有限内存,通过跳过昂贵的读写来减少端到端MV刷新时间;S/C的目标函数可以准确地估计在特定时间段内在内存中保留中间数据所节省的时间。我们的解决方案联合优化MV刷新顺序,什么数据保留在内存中以及何时释放数据。在高层次上,S/C仍然完全按照MV定义实例化所有数据;因此,它不会影响任何服务级别协议。在我们使用TPC-DS数据集(高达1TB)进行的实验中,我们展示了S/C优化可以在1.6GB内存的情况下将端到端运行时间加速1.04倍至5.08倍。