This work studies the problem of constructing a representative workload from a given input analytical query workload where the former serves as an approximation with guarantees of the latter. We discuss our work in the context of workload analysis and monitoring. As an example, evolving system usage patterns in a database system can cause load imbalance and performance regressions which can be controlled by monitoring system usage patterns, i.e.,~a representative workload, over time. To construct such a workload in a principled manner, we formalize the notions of workload {\em representativity} and {\em coverage}. These metrics capture the intuition that the distribution of features in a compressed workload should match a target distribution, increasing representativity, and include common queries as well as outliers, increasing coverage. We show that solving this problem optimally is NP-hard and present a novel greedy algorithm that provides approximation guarantees. We compare our techniques to established algorithms in this problem space such as sampling and clustering, and demonstrate advantages and key trade-offs of our techniques.
翻译:这项工作研究从特定投入分析查询工作量中确定代表工作量的问题,前者与后者的保障相近。我们在工作量分析和监测的范围内讨论我们的工作。例如,数据库系统中不断演变的系统使用模式可能会造成负荷不平衡和性能倒退,而这种不平衡和性能倒退可由监测系统使用模式,即代表工作量,在一段时间内由监测系统使用模式加以控制。为了以有原则的方式构建这种工作量,我们正式确定工作量(代表工作量)和覆盖面的概念。这些衡量尺度反映了一种直觉,即压缩工作量的特点的分配应该与目标分配相符,增加代表性,包括共同查询以及外部查询,扩大覆盖范围。我们表明,最理想地解决这个问题的方法是NP-硬的,是一种新的贪婪算法,提供近似保证。我们将我们的方法与在诸如抽样和集群等问题空间的既定算法进行比较,并展示我们技术的优势和关键权衡。