AI-based monitoring has become crucial for cloud-based services due to its scale. A common approach to AI-based monitoring is to detect causal relationships among service components and build a causal graph. Availability of domain information makes cloud systems even better suited for such causal detection approaches. In modern cloud systems, however, auto-scalers dynamically change the number of microservice instances, and a load-balancer manages the load on each instance. This poses a challenge for off-the-shelf causal structure detection techniques as they neither incorporate the system architectural domain information nor provide a way to model distributed compute across varying numbers of service instances. To address this, we develop CausIL, which detects a causal structure among service metrics by considering compute distributed across dynamic instances and incorporating domain knowledge derived from system architecture. Towards the application in cloud systems, CausIL estimates a causal graph using instance-specific variations in performance metrics, modeling multiple instances of a service as independent, conditional on system assumptions. Simulation study shows the efficacy of CausIL over baselines by improving graph estimation accuracy by ~25% as measured by Structural Hamming Distance whereas the real-world dataset demonstrates CausIL's applicability in deployment settings.
翻译:随着规模的扩大,基于人工智能的监控对云服务变得越来越重要。常用的AI监控方法是检测服务组件之间的因果关系并构建因果图。知领域信息的可用性使云系统更加适合这种因果检测方法。但是,在现代云系统中,自动缩放器动态更改微服务实例数,负载平衡器则管理每个实例的负载。这对通用的因果结构检测技术构成了挑战,因为它们既不包括系统架构方面的域知识,也不提供一种模拟不同数量服务实例间的分布式计算的方法。为解决这个问题,我们开发了名为CausIL的方法,它通过考虑分布对动态实例的计算和结合系统架构派生的域知识,来检测服务指标之间的因果结构。为了在云系统中应用CausIL,CausIL使用特定实例的性能指标的变化来估计因果图,将服务的多个实例建模为独立的,并依据系统假设进行条件建模。模拟研究表明,相较于基线,CausIL的效果提高了约25%,其因果图估计的精度是通过结构Hamming距离来衡量的。而真实世界的数据集则展示了CausIL在部署环境中的适用性。