Liquid cooling is critical for thermal management in high-density data centers with the rising AI workloads. However, machine learning-based controllers are essential to unlock greater energy efficiency and reliability, promoting sustainability. We present LC-Opt, a Sustainable Liquid Cooling (LC) benchmark environment, for reinforcement learning (RL) control strategies in energy-efficient liquid cooling of high-performance computing (HPC) systems. Built on the baseline of a high-fidelity digital twin of Oak Ridge National Lab's Frontier Supercomputer cooling system, LC-Opt provides detailed Modelica-based end-to-end models spanning site-level cooling towers to data center cabinets and server blade groups. RL agents optimize critical thermal controls like liquid supply temperature, flow rate, and granular valve actuation at the IT cabinet level, as well as cooling tower (CT) setpoints through a Gymnasium interface, with dynamic changes in workloads. This environment creates a multi-objective real-time optimization challenge balancing local thermal regulation and global energy efficiency, and also supports additional components like a heat recovery unit (HRU). We benchmark centralized and decentralized multi-agent RL approaches, demonstrate policy distillation into decision and regression trees for interpretable control, and explore LLM-based methods that explain control actions in natural language through an agentic mesh architecture designed to foster user trust and simplify system management. LC-Opt democratizes access to detailed, customizable liquid cooling models, enabling the ML community, operators, and vendors to develop sustainable data center liquid cooling control solutions.
翻译:随着人工智能工作负载的不断增加,液冷技术对于高密度数据中心的热管理至关重要。然而,基于机器学习的控制器对于释放更高的能效与可靠性、推动可持续性发展至关重要。我们提出了LC-Opt,一个可持续液冷基准测试环境,旨在用于高性能计算系统节能液冷的强化学习控制策略研究。该环境基于橡树岭国家实验室Frontier超级计算机冷却系统的高保真数字孪生基线构建,提供了从站点级冷却塔到数据中心机柜及服务器刀片组的详细、基于Modelica的端到端模型。强化学习智能体通过Gymnasium接口,在动态工作负载变化下,优化关键热控制参数,如IT机柜级的液体供应温度、流速和精细阀门调节,以及冷却塔设定点。该环境创造了一个平衡局部热调节与全局能效的多目标实时优化挑战,并支持热回收单元等附加组件。我们对集中式与分散式多智能体强化学习方法进行了基准测试,展示了将策略蒸馏为决策树和回归树以实现可解释控制的方法,并探索了基于大语言模型的方法——通过旨在增强用户信任和简化系统管理的智能体网状架构,以自然语言解释控制行为。LC-Opt使详细的、可定制的液冷模型得以普及,赋能机器学习社区、运营商和供应商开发可持续的数据中心液冷控制解决方案。