Identifying the failure modes of cloud computing systems is a difficult and time-consuming task, due to the growing complexity of such systems, and the large volume and noisiness of failure data. This paper presents a novel approach for analyzing failure data from cloud systems, in order to relieve human analysts from manually fine-tuning the data for feature engineering. The approach leverages Deep Embedded Clustering (DEC), a family of unsupervised clustering algorithms based on deep learning, which uses an autoencoder to optimize data dimensionality and inter-cluster variance. We applied the approach in the context of the OpenStack cloud computing platform, both on the raw failure data and in combination with an anomaly detection pre-processing algorithm. The results show that the performance of the proposed approach, in terms of purity of clusters, is comparable to, or in some cases even better than manually fine-tuned clustering, thus avoiding the need for deep domain knowledge and reducing the effort to perform the analysis. In all cases, the proposed approach provides better performance than unsupervised clustering when no feature engineering is applied to the data. Moreover, the distribution of failure modes from the proposed approach is closer to the actual frequency of the failure modes.
翻译:确定云计算系统的故障模式是一项困难和费时的任务,因为这类系统日益复杂,而且故障数据的数量和敏感度都很大。本文件介绍了分析云层系统故障数据的新办法,以方便人类分析人员对特征工程数据进行手工微调。该办法利用了深嵌嵌入集群(DEC),这是一套基于深层学习的未经监督的集群算法,它使用自动编码来优化数据维度和组际差异。我们在OpenStack云计算平台上采用了这种方法,既包括原始故障数据,也包括异常检测前算法。结果显示,就集群纯度而言,拟议方法的性能比手动微调集群(DEC)要好,在某些情况下甚至比手动精密组合更好,从而避免了对深海域知识的需求并减少了进行分析的努力。在所有情况下,拟议方法在未对数据应用特征工程时,其性能优于未经监督的集群。此外,拟议方法的故障模式分布比实际失败频率要近。