Today's large-scale services (e.g., video streaming platforms, data centers, sensor grids) need diverse real-time summary statistics across multiple subpopulations of multidimensional datasets. However, state-of-the-art frameworks do not offer general and accurate analytics in real time at reasonable costs. The root cause is the combinatorial explosion of data subpopulations and the diversity of summary statistics we need to monitor simultaneously. We present Hydra, an efficient framework for multidimensional analytics that presents a novel combination of using a ``sketch of sketches'' to avoid the overhead of monitoring exponentially-many subpopulations and universal sketching to ensure accurate estimates for multiple statistics. We build Hydra as an Apache Spark plugin and address practical system challenges to minimize overheads at scale. Across multiple real-world and synthetic multidimensional datasets, we show that Hydra can achieve robust error bounds and is an order of magnitude more efficient in terms of operational cost and memory footprint than existing frameworks (e.g., Spark, Druid) while ensuring interactive estimation times.
翻译:今天的大型服务(例如视频流流平台、数据中心、传感器网格)需要多种多层多层数据集的实时实时简要统计数据,然而,最新框架不能以合理的成本实时提供一般和准确的分析分析。 根本原因是数据子群的组合爆炸以及我们同时需要监测的汇总统计数据的多样性。 我们提出了九头蛇,这是一个多层面分析的有效框架,它提供了一种新型的组合,即使用“素描”来避免监测指数数子群和通用素描,以确保对多种统计数据作出准确的估计。我们把海德拉建成一个阿帕奇火花插件,并应对实际的系统挑战,以在规模上尽量减少间接费用。在多个现实世界和合成的多层面数据集中,我们表明海德拉可以实现强力的错误界限,并且比现有框架(例如斯帕奇、德鲁伊德)在业务成本和记忆足迹方面的效率更高,同时确保互动的估计时间。