Applications that fuse machine learning and simulation can benefit from the use of multiple computing resources, with, for example, simulation codes running on highly parallel supercomputers and AI training and inference tasks on specialized accelerators. Here, we present our experiences deploying two AI-guided simulation workflows across such heterogeneous systems. A unique aspect of our approach is our use of cloud-hosted management services to manage challenging aspects of cross-resource authentication and authorization, function-as-a-service (FaaS) function invocation, and data transfer. We show that these methods can achieve performance parity with systems that rely on direct connection between resources. We achieve parity by integrating the FaaS system and data transfer capabilities with a system that passes data by reference among managers and workers, and a user-configurable steering algorithm to hide data transfer latencies. We anticipate that this ease of use can enable routine use of heterogeneous resources in computational science.
翻译:例如,模拟代码运行在高度平行的超级计算机上,对专门加速器进行人工智能培训和推论任务。在这里,我们介绍我们的经验,在这种各式各样的系统中部署两个人工智能模拟工作流程。我们的方法的一个独特方面是利用云端管理服务来管理跨资源认证和授权、功能即服务功能的启用和数据传输等具有挑战性的方面。我们表明,这些方法可以实现与依赖资源直接连接的系统的业绩均等。我们通过将FaaS系统和数据传输能力与一个通过管理人员和工人参考传输数据的系统以及一个用户可配置的指导算法结合起来,以隐藏数据传输延迟。我们预计,这种容易使用能够使多种资源在计算科学中例行使用。</s>