We study the problem of monitoring distributed systems where computers communicate using message passing and share an almost synchronized clock. This is a realistic scenario for networks where the speed of the monitoring is sufficiently slow (at the human scale) to permit efficient clock synchronization, where the clock deviations is small compared to the monitoring cycles. This is the case when monitoring human systems in wide area networks, the Internet or including large deployments. More concretely, we study how to monitor decentralized systems where monitors are expressed as stream runtime verification specifications, under a timed asynchronous network. Our monitors communicate using the network, where messages can take arbitrarily long but cannot be duplicated or lost. This communication setting is common in many cyber-physical systems like smart buildings and ambient living. Previous approaches to decentralized monitoring were limited to synchronous networks, which are not easily implemented in practice because of network failures. Even when networks failures are unusual, they can require several monitoring cycles to be repaired. In this work we propose a solution to the timed asynchronous monitoring problem and show that this problem generalizes the synchronous case. We study the specifications and conditions on the network behavior that allow the monitoring to take place with bounded resources, independently of the trace length. Finally, we report the results of an empirical evaluation of an implementation and verify the theoretical results in terms of effectiveness and efficiency.
翻译:我们研究分布式系统的监测问题,即计算机利用传递信息进行通信,并共享一个几乎同步的时钟,这是监测速度足够慢(人的规模)的网络的现实情景,即监测速度足够慢(在人的规模上),以便实现高效的时钟同步,与监测周期相比,时钟偏差小;在广域网、互联网或包括大规模部署在内的范围内监测人类系统时,情况就是这样;更具体地说,我们研究如何监测分散式系统,在监测器以流时运行的核查规格显示监测器,在时间紧凑的网络下;我们的监测器利用网络进行通信,信息可以任意拖延,但不能重复或丢失;这种通信设置在智能建筑和环境生活等许多网络物理系统中十分常见。以前分散式监测的方法仅限于同步式的网络,由于网络失灵,这些网络在实践中不易实施。即使网络失灵,它们也需要几个监测周期才能加以修复。在这项工作中,我们提出了解决时间紧迫的监测问题的办法,并表明这一问题会普遍化。我们研究了网络行为的规格和条件,这是许多网络行为中常见的特征和条件,我们最后能够独立地核查结果,最后核查。