Science Data Systems (SDS) handle science data from acquisition through processing to distribution. They are deployed in the Cloud today, and the efficiency of Cloud instance utilization is critical to success. Conventional SDS are unable to take advantage of a cost-effective Amazon EC2 spot market, especially for long-running tasks. Some of the difficulties found in current practice at NASA/JPL are: a lack of mechanism for app programmers to save valuable partial results for future processing continuation, the heavy weight from using container-based (Singularity) sandboxes with more than 200,000 OS-level files; and the gap between scientists developing algorithms/programs on a laptop and the SDS experts deploying software in Cloud computing or supercomputing. We present a first proof-of-principle of this using NavP (Navigational Programming) and fault-tolerant computing (FTC) in SDS, by employing program state migration facilitated by Checkpoint-Restart (C/R). NavP provides a new navigational view of computations in a distributed world for the application programmers. The tool of DHP (DMTCP Hop and Publish) we developed enables the application programmers to navigate the computation among instances or nodes by inserting hop(destination) statements in their app code, and choose when to publish partial results at stages of their algorithms that they think worthwhile for future continuation. The result of using DHP is that a parallel distributed SDS becomes easier to program and deploy, and this enables more efficient leveraging of the Amazon EC2 Spot market. This technical report describes a high-level design and an initial implementation.
翻译:科学数据系统(SDS)处理从获取到处理到分配的科学数据。它们今天部署在云中,云体利用的效率是成功的关键。常规SDS无法利用成本效益高的亚马逊EC2现货市场,特别是长期任务。美国航天局/JPL目前做法中发现的一些困难是:应用程序程序员缺乏机制,无法保存宝贵的部分结果,以便今后继续处理,使用基于集装箱的、具有20多万OS级文件的沙箱,使用基于20万个集装箱的沙箱的(星体)沙箱的重量过重;开发膝上型算法/程序科学家与在云计算或超comput中部署软件的SDSDS专家之间的差距。我们首次提出使用NavP(导航程序)和容错计算(FTC)来证明这一原则的原则,利用“检查站-启动”(C/R)促进程序在分布世界范围内进行计算;DHP(DMCP)工具(DMP)和SDS专家在云中进行初步设计,从而在使用S-deald 数据系统中进行更精确的计算,我们能够将S-dealde 数据转换为SDSDFA 的结果。我们通过在使用其初步数据系统进行更精确的计算,从而在使用其最终的计算结果,在使用其初步的计算结果,从而将SDVDSDDDDDDDFDFDA 进行更能进行更精确的计算。