AI-based monitoring has become crucial for cloud-based services due to its scale. A common approach to AI-based monitoring is to detect causal relationships among service components and build a causal graph. Availability of domain information makes cloud systems even better suited for such causal detection approaches. In modern cloud systems, however, auto-scalers dynamically change the number of microservice instances, and a load-balancer manages the load on each instance. This poses a challenge for off-the-shelf causal structure detection techniques as they neither incorporate the system architectural domain information nor provide a way to model distributed compute across varying numbers of service instances. To address this, we develop CausIL, which detects a causal structure among service metrics by considering compute distributed across dynamic instances and incorporating domain knowledge derived from system architecture. Towards the application in cloud systems, CausIL estimates a causal graph using instance-specific variations in performance metrics, modeling multiple instances of a service as independent, conditional on system assumptions. Simulation study shows the efficacy of CausIL over baselines by improving graph estimation accuracy by ~25% as measured by Structural Hamming Distance whereas the real-world dataset demonstrates CausIL's applicability in deployment settings.
翻译:AI基础监测由于其规模而成为云级服务的关键。AI基础监测的一个共同方法是检测服务各组成部分之间的因果关系,并建立因果图表。域信息的提供使得云系统更适合这种因果检测方法。然而,在现代云系统,自动尺度系统动态地改变微服务实例的数量,负载平衡器管理每个实例的负载。这对现成因果结构检测技术提出了挑战,因为这些技术既未纳入系统建筑领域信息,也未提供在各种服务实例中进行分配的模型计算。为了解决这个问题,我们开发了CauSIL, 通过考虑在动态实例中进行计算并纳入从系统结构中得出的域知识,在服务计量中检测因果结构。在应用云系统时,CauSIL估计了因果图,使用具体实例的性能指标变化,对独立、以系统假设为条件的服务进行建模。模拟研究表明,CauSIL在基线上的效力,通过根据结构 Hamming Construction 和现实世界数据设置的可应用性来提高图表的精确度。</s>