Deployment, operation and maintenance of large IT systems becomes increasingly complex and puts human experts under extreme stress when problems occur. Therefore, utilization of machine learning (ML) and artificial intelligence (AI) is applied on IT system operation and maintenance - summarized in the term AIOps. One specific direction aims at the recognition of re-occurring anomaly types to enable remediation automation. However, due to IT system specific properties, especially their frequent changes (e.g. software updates, reconfiguration or hardware modernization), recognition of reoccurring anomaly types is challenging. Current methods mainly assume a static dimensionality of provided data. We propose a method that is invariant to dimensionality changes of given data. Resource metric data such as CPU utilization, allocated memory and others are modelled as multivariate time series. The extraction of temporal and spatial features together with the subsequent anomaly classification is realized by utilizing TELESTO, our novel graph convolutional neural network (GCNN) architecture. The experimental evaluation is conducted in a real-world cloud testbed deployment that is hosting two applications. Classification results of injected anomalies on a cassandra database node show that TELESTO outperforms the alternative GCNNs and achieves an overall classification accuracy of 85.1%. Classification results for the other nodes show accuracy values between 85% and 60%.
翻译:大型信息技术系统的部署、操作和维护日益复杂,在出现问题时,人类专家承受着极大的压力。因此,在信息技术系统的操作和维护中使用机器学习(ML)和人工智能(AI),在AIOPs中作了总结。一个具体的方向是确认反复出现的异常类型,以便能够进行补救自动化。然而,由于信息技术系统的具体特性,特别是其经常变化(例如软件更新、重组或硬件现代化),对反复出现的异常类型的认识是困难的。目前的方法主要假设所提供的数据具有静态的维度。我们建议采用一种无法改变特定数据的维度变化的方法。诸如CPU的利用、分配的记忆等资源计量数据以多变时间序列为模型。通过利用我们的新型图象变动神经网络(GCNN)结构(TELESTO)实现时间和空间特征的提取以及随后的异常分类。实验性评价是在现实世界云层测试床部署中进行的,它正在容纳两种应用。我们提出的一个Casandra数据库输入异常的分类结果显示,TELESTO的精确度、分配的记忆和其他数值的精确度为85 %。