Spot instances offer a cost-effective solution for applications running in the cloud computing environment. However, it is challenging to run long-running jobs on spot instances because they are subject to unpredictable evictions. Here, we present Spot-on, a generic software framework that supports fault-tolerant long-running workloads on spot instances through checkpoint and restart. Spot-on leverages existing checkpointing packages and is compatible with the major cloud vendors. Using a genomics application as a test case, we demonstrated that Spot-on supports both application-specific and transparent checkpointing methods. Compared to running applications using on-demand instances, it allows the completion of these workloads for a significant reduction in computing costs. Compared to running applications using application-specific checkpoint mechanisms, transparent checkpoint-protected applications reduce runtime by up to 40%, leading to further cost savings of up to 86%.
翻译:亮点实例为在云计算环境中运行的应用提供了一个具有成本效益的解决方案。 但是,在现场运行长期工作具有挑战性,因为它们会受到无法预测的驱逐。 在这里,我们介绍一个通用软件框架Spot-on,这个通用软件框架通过检查站和重新启动支持在现场运行的错误容忍性长期工作量。 点点点利用现有的检查站包,并与主要云销售商兼容。 使用基因组应用作为测试案例,我们证明Spot-on支持应用程序特定和透明的检查站方法。 与按需运行应用程序相比,它允许完成这些工作量,以大幅降低计算成本。 与使用具体应用程序的检查站机制运行应用程序相比,透明的受检查站保护的申请将运行时间减少高达40%,从而进一步节约高达86%的成本。