Moving scientific computation from high-performance computing (HPC) and cloud computing (CC) environments to devices on the edge, where data can be collected by streamlined computing devices that are physically located near instruments of interest, has received tremendous interest in recent years. Such edge computing environments can operate on data in-situ instead of requiring the collection of data in HPC and/or CC facilities, offering enticing benefits that include avoiding costs of transmission over potentially unreliable or slow networks, increased data privacy, and real-time data analysis. Before such benefits can be realized at scale, new fault tolerant approaches must be developed to address the inherent unreliability of edge computing environments, because the traditional resilience approaches used by HPC and CC are not generally applicable to edge computing. Those traditional approaches commonly utilize checkpoint-and-restart and/or redundant-computation strategies that are not feasible for edge computing environments where data storage is limited and synchronization is expensive. Motivated by prior algorithm-based fault tolerance approaches, a variant of the asynchronous Jacobi (ASJ) method is developed herein with resilience to data corruption achieved by leveraging existing convergence theory. The proposed ASJ variant rejects solution approximations from neighbor devices if the distance between two successive approximations violates an analytic bound. Numerical results show the ASJ variant restores convergence in the presence of certain types of natural and malicious data corruption.
翻译:暂无翻译