We tackle the problem of predicting the performance of MapReduce applications, designing accurate progress indicators that keep programmers informed on the percentage of completed computation time during the execution of a job. Through extensive experiments, we show that state-of-the-art progress indicators (including the one provided by Hadoop) can be seriously harmed by data skewness, load unbalancing, and straggling tasks. This is mainly due to their implicit assumption that the running time depends linearly on the input size. We thus design a novel profile-guided progress indicator, called NearestFit, that operates without the linear hypothesis assumption and exploits a careful combination of nearest neighbor regression and statistical curve fitting techniques. Our theoretical progress model requires fine-grained profile data, that can be very difficult to manage in practice. To overcome this issue, we resort to computing accurate approximations for some of the quantities used in our model through space- and time-efficient data streaming algorithms. We implemented NearestFit on top of Hadoop 2.6.0. An extensive empirical assessment over the Amazon EC2 platform on a variety of real-world benchmarks shows that NearestFit is practical w.r.t. space and time overheads and that its accuracy is generally very good, even in scenarios where competitors incur non-negligible errors and wide prediction fluctuations. Overall, NearestFit significantly improves the current state-of-art on progress analysis for MapReduce.
翻译:我们处理预测MapReduce应用软件的性能问题,设计准确的进展指标,让程序员了解完成计算时间在完成某项工作期间的百分比。通过广泛的实验,我们显示最先进的进度指标(包括Hadoop提供的进展指标)可能受到数据偏差、不平衡的负荷和交错任务的严重损害。这主要是因为它们暗含的假设,即运行的时间线性取决于输入量大小。我们因此设计了一个新的剖面指导的进展指标,称为NearestFit,在没有线性假设假设的情况下运行,并且利用了最近的近邻回归和统计曲线安装技术的仔细组合。我们的理论性进展模型需要精确的剖面数据(包括Hadoop提供的数据),这在实践上可能非常困难。为了克服这个问题,我们通过空间和时间效率高的数据流算法计算出我们模型使用的某些数量的准确的近距离近距离精确度。我们在Hadoboop 2.6.0顶部对亚马孙EC2平台进行了广泛的实证评估,在各种现实世界基准上,甚至远方的准确性预测是近距离的近距离。