Open source cloud technologies provide a wide range of support for creating customized compute node clusters to schedule tasks and managing resources. In cloud infrastructures such as Jetstream and Chameleon, which are used for scientific research, users receive complete control of the Virtual Machines (VM) that are allocated to them. Importantly, users get root access to the VMs. This provides an opportunity for HPC users to experiment with new resource management technologies such as Apache Mesos that have proven scalability, flexibility, and fault tolerance. To ease the development and deployment of HPC tools on the cloud, the containerization technology has matured and is gaining interest in the scientific community. In particular, several well known scientific code bases now have publicly available Docker containers. While Mesos provides support for Docker containers to execute individually, it does not provide support for container inter-communication or orchestration of the containers for a parallel or distributed application. In this paper, we present the design, implementation, and performance analysis of a Mesos framework, Scylla, which integrates Mesos with Docker Swarm to enable orchestration of MPI jobs on a cluster of VMs acquired from the Chameleon cloud [1]. Scylla uses Docker Swarm for communication between containerized tasks (MPI processes) and Apache Mesos for resource pooling and allocation. Scylla allows a policy-driven approach to determine how the containers should be distributed across the nodes depending on the CPU, memory, and network throughput requirement for each application.
翻译:开源云技术为创建定制化的计算节点群集以安排任务和管理资源提供了广泛的支持。 在用于科学研究的喷气流和变色龙等云层基础设施中,用户完全控制分配给他们的虚拟机器。 重要的是,用户获得对VMs的根访问。 这为HPC用户提供了一个机会,以试验新的资源管理技术,如已证明可缩放、灵活和容错度的Apache Mesos等已证明具有可缩放性、灵活性和差分度的新资源管理技术。为了便利HPC工具在云层的开发和部署,集装箱化技术已经成熟,并越来越受到科学界的兴趣。 特别是,一些众所周知的科学代码基地现在有可供公众使用的多克集装箱。 虽然Mesos为多克集装箱单独执行提供支持,但它并不为集装箱之间的通信或协调提供支持。 在本文中,我们介绍Mesos 框架的设计、实施和业绩分析,Scyllla 将Mesos与Decker Swarm公司整合起来,以便能够在Scharmall 网络的组合中进行MPI工作, 将Skellal 和Skinal-lishal 分配的Slicommal 工作用于DISL ASI 。