Reliability management in cloud service systems is challenging due to the cascading effect of failures. Error wrapping, a practice prevalent in modern microservice development, enriches errors with context at each layer of the function call stack, constructing an error chain that describes a failure from its technical origin to its business impact. However, this also presents a significant traceability problem when recovering the complete error propagation path from the final log message back to its source. Existing approaches are ineffective at addressing this problem. To fill this gap, we present ErrorPrism in this work for automated reconstruction of error propagation paths in production microservice systems. ErrorPrism first performs static analysis on service code repositories to build a function call graph and map log strings to relevant candidate functions. This significantly reduces the path search space for subsequent analysis. Then, ErrorPrism employs an LLM agent to perform an iterative backward search to accurately reconstruct the complete, multi-hop error path. Evaluated on 67 production microservices at ByteDance, ErrorPrism achieves 97.0% accuracy in reconstructing paths for 102 real-world errors, outperforming existing static analysis and LLM-based approaches. ErrorPrism provides an effective and practical tool for root cause analysis in industrial microservice systems.
翻译:云服务系统的可靠性管理因故障的级联效应而具有挑战性。错误包装作为现代微服务开发中的普遍实践,通过在函数调用栈的每一层用上下文信息丰富错误,构建了一条描述故障从技术根源到业务影响的错误链。然而,当需要从最终日志消息回溯至其源头以恢复完整的错误传播路径时,这也带来了显著的可追溯性问题。现有方法在解决此问题上效果不佳。为填补这一空白,我们在本文中提出了ErrorPrism,用于在生产微服务系统中自动重构错误传播路径。ErrorPrism首先对服务代码仓库进行静态分析,以构建函数调用图并将日志字符串映射到相关的候选函数。这显著减少了后续分析的路径搜索空间。随后,ErrorPrism采用一个LLM代理执行迭代式反向搜索,以精确重构完整的、多跳的错误路径。在字节跳动的67个生产微服务上进行评估,ErrorPrism在重构102个真实世界错误的路径上达到了97.0%的准确率,优于现有的静态分析和基于LLM的方法。ErrorPrism为工业微服务系统中的根因分析提供了一个有效且实用的工具。