Microservice-based architectures enable different aspects of web applications to be created and updated independently, even after deployment. Associated technologies such as service mesh provide application-level fault resilience through attribute configurations that govern the behavior of request-response service -- and the interactions among them -- in the presence of failures. While this provides tremendous flexibility, the configured values of these attributes -- and the relationships among them -- can significantly affect the performance and fault resilience of the overall application. Furthermore, it is impossible to determine the best and worst combinations of attribute values with respect to fault resiliency via testing, due to the complexities of the underlying distributed system and the many possible attribute value combinations. In this paper, we present a model-based reinforcement learning workflow towards service mesh fault resiliency. Our approach enables the prediction of the most significant fault resilience behaviors at a web application-level, scratching from single service to aggregated multi-service management with efficient agent collaborations.
翻译:微观服务架构使网络应用程序的不同方面能够独立创建和更新,即使是在部署后也是如此。服务网格等相关技术通过在出现故障时指导请求回复服务行为 -- -- 以及它们之间的互动 -- -- 的属性配置提供应用级故障复原力。虽然这提供了巨大的灵活性,但这些属性的配置值 -- -- 以及它们之间的关系 -- -- 能够显著影响总体应用的性能和故障复原力。此外,由于基本分布系统的复杂性和许多可能的属性组合,因此无法确定通过测试对故障复原力的属性值进行最佳和最坏的组合。在本文件中,我们介绍了基于模型的强化学习工作流程,以针对服务网格复原力。我们的方法可以预测在网络应用层面最重大的故障复原力行为,从单一服务到与高效的代理协作合并多服务管理。