Microservice based systems underpin modern distributed computing environments but remain vulnerable to partial failures, cascading timeouts, and inconsistent recovery behavior. Although numerous resilience and recovery patterns have been proposed, existing surveys are largely descriptive and lack systematic evidence synthesis or quantitative rigor. This paper presents a PRISMA aligned systematic literature review of empirical studies on microservice recovery strategies published between 2014 and 2025 across IEEE Xplore, ACM Digital Library, and Scopus. From an initial corpus of 412 records, 26 high quality studies were selected using transparent inclusion, exclusion, and quality assessment criteria. The review identifies nine recurring resilience themes encompassing circuit breakers, retries with jitter and budgets, sagas with compensation, idempotency, bulkheads, adaptive backpressure, observability, and chaos validation. As a data oriented contribution, the paper introduces a Recovery Pattern Taxonomy, a Resilience Evaluation Score checklist for standardized benchmarking, and a constraint aware decision matrix mapping latency, consistency, and cost trade offs to appropriate recovery mechanisms. The results consolidate fragmented resilience research into a structured and analyzable evidence base that supports reproducible evaluation and informed design of fault tolerant and performance aware microservice systems.
翻译:基于微服务的系统是现代分布式计算环境的基础,但仍易受部分故障、级联超时和不一致恢复行为的影响。尽管已提出众多弹性和恢复模式,现有综述大多为描述性研究,缺乏系统性证据综合或定量严谨性。本文采用PRISMA框架,对2014年至2025年间发表于IEEE Xplore、ACM Digital Library和Scopus数据库中关于微服务恢复策略的实证研究进行了系统性文献综述。从412篇初始文献中,通过透明的纳入、排除和质量评估标准,筛选出26项高质量研究。本综述识别出九大重复出现的弹性主题,涵盖断路器、带抖动和预算的重试机制、带补偿的Saga模式、幂等性、舱壁隔离、自适应背压、可观测性以及混沌验证。作为数据导向的贡献,本文提出了恢复模式分类法、用于标准化基准测试的弹性评估评分清单,以及将延迟、一致性和成本权衡映射到相应恢复机制的约束感知决策矩阵。研究结果将碎片化的弹性研究整合为结构化、可分析的证据体系,为可复现的评估以及容错且性能感知的微服务系统的知情设计提供支持。