Enterprise cloud developers have to build applications that are resilient to failures and interruptions. We advocate for, formalize, implement, and evaluate a simple, albeit effective, fault-tolerant programming model for the cloud based on actors, reliable message delivery, and retry orchestration. Our model guarantees that (1) failed actor invocations are retried until success, (2) in a distributed chain of invocations only the last one may be retried, (3) pending synchronous invocations with a failed caller are automatically cancelled. These guarantees make it possible to productively develop fault-tolerant distributed applications ranging from classic problems of concurrency theory to complex enterprise applications. Built as a service mesh, our runtime system can interface application components written in any programming language and scale with the application. We measure overhead relative to reliable message queues. Using an application inspired by a typical enterprise scenario, we assess fault tolerance and the impact of fault recovery on application performance.
翻译:企业云开发者必须建立适应失败和中断的应用程序。 我们倡导、 正式化、 实施并评估一个简单但有效且容错的云层编程模式, 其基础是演员、 可靠的信息发送和重新操控。 我们的模型保证:(1) 失败的行为者的念头被重新审阅, 直到成功为止, (2) 在一个分布式的引用链中, 只有最后一项可以再审, (3) 等待与失败的调用器同步的引用被自动取消。 这些保证使得能够有效地开发一个容错的分布式应用程序, 从典型的货币理论问题到复杂的企业应用程序。 作为服务网格,我们的运行时间系统可以将以任何编程语言和规模写成的应用程序组件与应用程序连接起来。 我们测量相对于可靠的信息排队的间接费用。 我们使用一个受典型企业情景启发的应用,我们评估错误容忍度和错误回收对应用绩效的影响。