Enterprise cloud developers have to build applications that are resilient to failures and interruptions. We advocate for, formalize, implement, and evaluate a simple, albeit effective, fault-tolerant programming model for the cloud based on actors, reliable message delivery, and retry orchestration. Our model simultaneously guarantees that (1) failed actor invocations are retried until success and (2) that a strict happens before relationship is preserved across failures within each distributed chain of invocations and retries. These guarantees make it possible to productively develop fault-tolerant distributed applications leveraging cloud services, ranging from classic problems of concurrency theory to enterprise applications. Built as a service mesh, our runtime can compose application components written in any programming language and scale with the application. We measure overhead relative to reliable message queues. Using an application inspired by a typical enterprise scenario, we assess fault tolerance and the impact of fault recovery on performance.
翻译:企业云开发者必须建立适应失败和中断的应用程序。 我们倡导、 正式化、 实施并评估一个简单但有效且容错的云型编程模式, 其基础是行为者、 可靠的信息发送和重新操控。 我们的模式同时保证:(1) 失败的行为者在成功之前进行重审; (2) 在每一个分布式的发明和回调链的失败之间保持关系之前, 严格地发生关系。 这些保证使得能够有效地开发对错误容忍的分布式应用程序, 利用云服务, 从典型的货币理论问题到企业应用。 作为服务网, 我们的运行时间可以编译应用程序中以任何程序语言和规模写成的应用程序组件。 我们衡量相对于可靠的信息排队的间接费用。 我们使用一个典型的企业情景, 评估错误容忍度和错误恢复对业绩的影响。