Distributed in-memory data processing engines accelerate iterative applications by caching substantial datasets in memory rather than recomputing them in each iteration. Selecting a suitable cluster size for caching these datasets plays an essential role in achieving optimal performance. In practice, this is a tedious and hard task for end users, who are typically not aware of cluster specifications, workload semantics and sizes of intermediate data. We present Blink, an autonomous sampling-based framework, which predicts sizes of cached datasets and selects optimal cluster size without relying on historical runs. We evaluate Blink on a variety of iterative, real-world, machine learning applications. With an average sample runs cost of 4.6% compared to the cost of optimal runs, Blink selects the optimal cluster size in 15 out of 16 cases, saving up to 47.4% of execution cost compared to average costs.
翻译:存储在记忆中存取大量数据集,而不是在每个迭代中重新对之进行计算,从而加速迭代应用。为缓存这些数据集选择适当的组群大小对于取得最佳性能至关重要。在实践中,这对终端用户来说是一项繁琐而艰巨的任务,他们通常不了解群集规格、工作量的语义和中间数据的规模。我们提出了一个自动抽样框架Blink,它预测隐藏数据集的大小,并选择最佳组群规模而不依赖历史运行。我们评估了各种迭接、真实世界和机器学习应用程序的链接。平均抽样成本为4.6%,而最佳运行的成本为4.6%,Blink在16个案例中选择了最佳组群规模,比平均成本节省了47.4%的执行成本。