在大型云计算平台上通过非侵入性事件分析进行实时故障检测 (Run-time Failure Detection via Non-intrusive Event Analysis in a Large-Scale Cloud Computing Platform)

Cloud computing systems fail in complex and unforeseen ways due to unexpected combinations of events and interactions among hardware and software components. These failures are especially problematic when they are silent, i.e., not accompanied by any explicit failure notification, hindering the timely detection and recovery. In this work, we propose an approach to run-time failure detection tailored for monitoring multi-tenant and concurrent cloud computing systems. The approach uses a non-intrusive form of event tracing, without manual changes to the system's internals to propagate session identifiers (IDs), and builds a set of lightweight monitoring rules from fault-free executions. We evaluated the effectiveness of the approach in detecting failures in the context of the OpenStack cloud computing platform, a complex and "off-the-shelf" distributed system, by executing a campaign of fault injection experiments in a multi-tenant scenario. Our experiments show that the approach detects the failure with an F1 score (0.85) and accuracy (0.77) higher than the ones provided by the OpenStack failure logging mechanisms (0.53 and 0.50) and two non--session-aware run-time verification approaches (both lower than 0.15). Moreover, the approach significantly decreases the average time to detect failures at run-time (~114 seconds) compared to the OpenStack logging mechanisms.

翻译：云计算系统由于各种事件以及硬件和软件各组成部分之间相互作用的意外组合而以复杂和意外的方式失灵。这些失灵尤其当它们保持沉默时特别成问题,即没有明显的失灵通知,从而妨碍及时的探测和复原。在这项工作中,我们建议了一种方法,用于监测多租赁和同时存在的云计算系统,对运行时失灵检测进行专门设计,用于监测多租赁和并行云计算系统。该方法使用一种非侵入性的事件追踪形式,不以人工方式改变系统内部以传播会话标识(ID),建立一套无过失处决的轻量监测规则。我们评估了在OpenStack云计算平台、一个复杂和“现成”分布式的系统中发现失灵的方法的有效性,在多租赁情况下进行过错注射实验。我们的实验表明,该方法用F1分(0.85)和准确度(0.77)比OpenStack系统失灵记录机制(0.53和0.50)和两个非会期运行期间的核查方法(比开放时间- 1514秒的测算方法要大大下降)。