Reliability of complex Cyber-Physical Systems is necessary to guarantee availability and/or safety of the provided services. Diverse and complex fault tolerance policies are adopted to enhance reliability, that include a varied mix of redundancy and dynamic reconfiguration to address hardware reliability, as well as specific software reliability techniques like diversity or software rejuvenation. These complex policies call for flexible runtime health checks of system executions that go beyond conventional runtime monitoring of pre-programmed health conditions, also in order to minimize maintenance costs. Defining a suitable monitoring model in the application of this method in complex systems is still a challenge. In this paper we propose a novel approach, Reliability Based Monitoring (RBM), for a flexible runtime monitoring of reliability in complex systems, that exploits a hierarchical reliability model periodically applied to runtime diagnostics data: this allows to dynamically plan maintenance activities aimed at prevent failures. As a proof of concept, we show how to apply RBM to a 2oo3 software system implementing different fault-tolerant policies.
翻译:复杂的网络-物理系统的可靠性对于保证所提供的服务的提供和/或安全来说是必要的。为了提高可靠性,采取了多种复杂的防故障政策,其中包括多种组合的冗余和动态重组,以解决硬件可靠性问题,以及具体的软件可靠性技术,例如多样性或软件更新。这些复杂的政策要求对系统执行的系统执行进行灵活的运行时间健康检查,这种检查超越了对预先规划的健康状况的常规运行时间监测,也是为了尽量减少维护费用。确定在复杂系统中应用这一方法的适当监测模式仍然是一个挑战。在本文件中,我们建议采用一种新的方法,即基于可靠性的监测(RBM),对复杂系统的可靠性进行灵活的运行时间监测,利用一个等级可靠性模型定期用于运行诊断数据:这可以动态地规划旨在预防失败的维护活动。作为概念的证明,我们展示了如何将成果管理制应用于一个实施不同错误容忍政策的210软件系统。