Modern cloud computing systems contain hundreds to thousands of computing and storage servers. Such a scale, combined with ever-growing system complexity, is causing a key challenge to failure and resource management for dependable cloud computing. Autonomic failure detection is a crucial technique for understanding emergent, cloud-wide phenomena and self-managing cloud resources for system-level dependability assurance. To detect failures, we need to monitor the cloud execution and collect runtime performance data. These data are usually unlabeled, and thus a prior failure history is not always available in production clouds. In this paper, we present a \emph{self-evolving anomaly detection} (SEAD) framework for cloud dependability assurance. Our framework self-evolves by recursively exploring newly verified anomaly records and continuously updating the anomaly detector online. As a distinct advantage of our framework, cloud system administrators only need to check a small number of detected anomalies, and their decisions are leveraged to update the detector. Thus, the detector evolves following the upgrade of system hardware, update of the software stack, and change of user workloads. Moreover, we design two types of detectors, one for general anomaly detection and the other for type-specific anomaly detection. With the help of self-evolving techniques, our detectors can achieve 88.94\% in sensitivity and 94.60\% in specificity on average, which makes them suitable for real-world deployment.
翻译:现代云计算系统包含数以万计的计算和存储服务器。 这种规模,加上系统复杂性的不断增加,正在给失败和可靠云计算资源管理带来重大挑战。 自动检测失败是了解突发现象、 整个云层现象和对云层资源进行自我管理、 系统一级可靠性保证的关键技术。 为了检测失败, 我们需要监测云层执行和收集运行时性能数据。 这些数据通常没有标签, 生产云层中并不总是有先前的故障历史。 在本文中, 我们为云的可靠性提供了一种自演反常检测( SEAD) 框架。 我们的框架自演化自演化通过反复探索新核实的异常记录和不断更新异常探测器在线进行。 作为我们框架的一个明显优势, 云系统管理员只需要检查少量被检测的异常, 并用他们的决定来更新探测器。 因此, 在系统硬件升级、 软件库更新以及用户工作量变化之后, 探测器不断演变。 此外, 我们设计了两种类型的真实的探测器, 一种用于常规的自我检测, 一种用于常规的自我检测和平均检测。