Many scientific workflows require dedicated compute resources, including HPC clusters with optimized software, quantum resources, and dedicated hardware cluster systems like Ray, for example. At the same time, many scientific workflows today are built on Kubernetes leveraging growing support for workflow and support tools. To address the growing demand to support workflows on both cloud and dedicated compute resources we present the Bridge Operator, a software extension for container orchestration in Kubernetes which facilitates the submission and monitoring of long running processes on external systems which have their own cluster resources manager (SLURM, LSF, quantum services and Ray). The Bridge Operator consists of a custom Kubernetes controller that employs a Kubernetes Custom Resource Definition to manage applications. We present controller logic to manage the cloud container orchestration and external resource workload manager interface, a resource definition to submit HTTP/HTTPS requests to the external resource, and a controller pod communicating with the external resource manager to submit and manage job execution. The implementation allows us to mirror the external resource in Kubernetes pods, which allows the operator to use these pods as proxies to control the external system. The implementation is agnostic to the choice of resource manager but assumes the system exposes a HTTP/HTTPS API for its control/management. The Bridge Operator automates the role of a human operator running jobs on a black box external resource as part of a complex hybrid workflow on the Cloud.
翻译:许多科学工作流程需要专门的计算资源,包括拥有优化软件、量子资源以及雷等专用硬件集群系统的HPC集群。与此同时,许多科学工作流程如今都建在Kubernetes上,利用对工作流程和支持工具的日益增长的支持。为了应对不断增长的需求,支持云层和专用计算资源的工作流程,我们介绍了Bridge操作员;库伯涅茨集装箱管弦工作的软件扩展,便利提交和监测有其自身集群资源管理员(SLURM、LSF、量子服务和Ray)的外部系统的长期运行流程。桥梁操作员包括一个自定义的Kubernetes控制器,该控制器使用Kubernetes自定义来管理应用程序。我们提出了管理云容器管和外部资源工作量管理器接口的控制逻辑,资源定义将HTTP/HTPS请求提交外部资源管理员,并与外部资源管理员沟通控制职位执行。实施允许操作员在库伯尔涅茨(SLSFSF)的外部资源库中将这些驱动器用于控制外部系统。