Modern large-scale computing systems distribute jobs into multiple smaller tasks which execute in parallel to accelerate job completion rates and reduce energy consumption. However, a common performance problem in such systems is dealing with straggler tasks that are slow running instances that increase the overall response time. Such tasks can significantly impact the system's Quality of Service (QoS) and the Service Level Agreements (SLA). To combat this issue, there is a need for automatic straggler detection and mitigation mechanisms that execute jobs without violating the SLA. Prior work typically builds reactive models that focus first on detection and then mitigation of straggler tasks, which leads to delays. Other works use prediction based proactive mechanisms, but ignore heterogeneous host or volatile task characteristics. In this paper, we propose a Straggler Prediction and Mitigation Technique (START) that is able to predict which tasks might be stragglers and dynamically adapt scheduling to achieve lower response times. Our technique analyzes all tasks and hosts based on compute and network resource consumption using an Encoder Long-Short-Term-Memory (LSTM) network. The output of this network is then used to predict and mitigate expected straggler tasks. This reduces the SLA violation rate and execution time without compromising QoS. Specifically, we use the CloudSim toolkit to simulate START in a cloud environment and compare it with state-of-the-art techniques (IGRU-SD, SGC, Dolly, GRASS, NearestFit and Wrangler) in terms of QoS parameters such as energy consumption, execution time, resource contention, CPU utilization and SLA violation rate. Experiments show that START reduces execution time, resource contention, energy and SLA violations by 13%, 11%, 16% and 19%, respectively, compared to the state-of-the-art approaches.
翻译:现代大型计算机系统将工作分配成多个较小的任务,这些任务同时执行,以加快工作完成率和减少能源消耗。然而,这些系统中一个共同的绩效问题正在处理递减任务,这些任务运行缓慢,增加了整体反应时间。这些任务可以极大地影响系统的服务质量(QOS)和服务级协议(SLA)。为了解决这个问题,需要自动递减检测和缓解机制,在不违反服务级协议的情况下执行工作。先前的工作通常会建立反应模型,首先侧重于检测并随后减缓导致延误的递减任务。其他工作则使用基于预测的机制,但忽略了变异主机或波动任务特点。在本文中,我们提议采用斯特拉格勒预测和缓解技术(START),从而可以预测哪些任务可能是累累,并动态地调整时间表以降低响应时间。我们的技术分析所有任务和网络资源消耗,使用Encoder C-SLS-Sort-Steral-Smart-SLSLTM(LSTM)网络。这个网络的输出产出是用来预测和降低时间规则执行率。