AI workloads, often hosted in multi-tenant cloud environments, require vast computational resources but suffer inefficiencies due to limited tenant-provider coordination. Tenants lack infrastructure insights, while providers lack workload details to optimize tasks like partitioning, scheduling, and fault tolerance. We propose the HarmonAIze project to redefine cloud abstractions, enabling cooperative optimization for improved performance, efficiency, resiliency, and sustainability. This paper outlines key opportunities, challenges, and a research agenda to realize this vision.
翻译:暂无翻译