Intelligent fault-tolerant (FT) computing has recently demonstrated significant advantages in predicting and diagnosing faults proactively, thereby ensuring reliable service delivery. However, due to the heterogeneity of fault knowledge, dynamic workloads, and limited data support, existing deep learning-based FT algorithms face challenges in fault detection quality and training efficiency. This is primarily because their homogenization of fault knowledge perception difficuties to fully capture diverse and complex fault patterns. To address these challenges, we propose FT-MoE, a sustainable-learning fault-tolerant computing framework based on a dual-path architecture for high-accuracy fault detection and classification. This model employs a mixture-of-experts (MoE) architecture, enabling different parameters to learn distinct fault knowledge. Additionally, we adopt a two-stage learning scheme that combines comprehensive offline training with continual online tuning, allowing the model to adaptively optimize its parameters in response to evolving real-time workloads. To facilitate realistic evaluation, we construct a new fault detection and classification dataset for edge networks, comprising 10,000 intervals with fine-grained resource features, surpassing existing datasets in both scale and granularity. Finally, we conduct extensive experiments on the FT benchmark to verify the effectiveness of FT-MoE. Results demonstrate that our model outperforms state-of-the-art methods.
翻译:智能容错计算近期在主动预测与诊断故障方面展现出显著优势,从而保障了服务的可靠交付。然而,由于故障知识的异构性、动态工作负载以及有限的数据支持,现有基于深度学习的容错算法在故障检测质量与训练效率方面面临挑战。这主要源于其对故障知识感知的同质化处理难以充分捕捉多样且复杂的故障模式。为应对这些挑战,我们提出FT-MoE——一种基于双路径架构的可持续学习容错计算框架,旨在实现高精度故障检测与分类。该模型采用专家混合架构,使不同参数能够学习差异化的故障知识。此外,我们采用两阶段学习方案,将全面离线训练与持续在线调优相结合,使模型能够根据实时演化的动态工作负载自适应优化参数。为支持真实场景评估,我们构建了一个面向边缘网络的新型故障检测与分类数据集,包含10,000个具有细粒度资源特征的时序区间,在规模与精细度上均超越现有数据集。最后,我们在容错基准测试上进行了广泛实验以验证FT-MoE的有效性。结果表明,该模型性能优于当前最先进方法。