System-level virtualization introduces critical vulnerabilities to failures of the software components that implement virtualization -- the virtualization infrastructure (VI). To mitigate the impact of such failures, we introduce a resilient VI (RVI) that can recover individual VI components from failure, caused by hardware or software faults, transparently to the hosted virtual machines (VMs). Much of the focus is on the ReHype mechanism for recovery from hypervisor failures, that can lead to state corruption and to inconsistencies among the states of system components. ReHype's implementation for the Xen hypervisor was done incrementally, using fault injection results to identify sources of critical corruption and inconsistencies. This implementation involved 900 LOC, with memory space overhead of 2.1MB. Fault injection campaigns, with a variety of fault types, show that ReHype can successfully recover, in less than 750ms, from over 88% of detected hypervisor failures. In addition to ReHype, recovery mechanisms for the other VI components are described. The overall effectiveness of our RVI is evaluated hosting a Web service application, on a cluster of VMs. With faults in any VI component, for over 87% of detected failures, our recovery mechanisms allow services provided by the application to be continuously maintained despite the resulting failures of VI components.
翻译:系统一级的虚拟化使实施虚拟化的软件组件(虚拟化基础设施(VI)发生故障时极易发生故障。为了减轻这类故障的影响,我们引入了具有复原力的VI(RVI),能够从硬件或软件故障造成的故障中恢复单个六分元元元元件,以透明的方式向托管的虚拟机器(VMs)恢复。许多重点是超视障恢复的ReHype机制,这可能导致国家腐败,并导致系统元件各邦之间的不一致。ReHype对Xen超视仪的安装是渐进式的,使用错误注入结果来查明严重腐败和不一致的来源。这一实施涉及900 LOC,存储空间管理为2.1MB。有各种缺陷的入侵性注射活动表明,ReHype能够在不到750米的情况下成功地从检测到的88%以上的超视障故障中恢复。除了ReHype外,还描述了其他六分元元元元元元的恢复机制。我们RHype在VMS集群上托管一个网络服务应用程序的总体有效性得到了评估。在VI中存在任何缺陷的任何六分元件部分中,由于六分机的故障而导致的故障得以的恢复机制的故障得以保持了超过87%的故障。