Cloud platforms run many software agents on each server node. These agents manage all aspects of node operation, and in some cases frequently collect data and make decisions. Unfortunately, their behavior is typically based on pre-defined static heuristics or offline analysis; they do not leverage on-node machine learning (ML). In this paper, we first characterize the spectrum of node agents in Azure, and identify the classes of agents that are most likely to benefit from on-node ML. We then propose SOL, an extensible framework for designing ML-based agents that are safe and robust to the range of failure conditions that occur in production. SOL provides a simple API to agent developers and manages the scheduling and running of the agent-specific functions they write. We illustrate the use of SOL by implementing three ML-based agents that manage CPU cores, node power, and memory placement. Our experiments show that (1) ML substantially improves our agents, and (2) SOL ensures that agents operate safely under a variety of failure conditions. We conclude that ML-based agents show significant potential and that SOL can help build them.
翻译:云层平台在每个服务器节点上运行许多软件代理器。 这些代理器管理节点操作的所有方面, 并在某些情况下经常收集数据和作出决定。 不幸的是, 它们的行为通常基于预先定义的静态超律学或离线分析; 它们没有利用节点机器学习( ML ) 。 在本文中, 我们首先对Azure 的节点代理器频谱进行定性, 并确定最有可能从节点ML受益的各类代理器。 我们然后提出 SOL, 这是设计基于ML 的代理器的可扩展框架, 它对于生产过程中发生的一系列故障条件是安全的和稳健的。 SOL 向代理器开发者提供简单的API, 并管理他们所写的代理器特定功能的时间安排和运行。 我们通过实施三个基于 ML 的代理器来说明 SOL 的使用情况, 管理 CPU 核心、 节点能量和记忆定位。 我们的实验显示 (1) ML 大大改进了我们的代理器, 和 (2) SOL 确保代理器剂在各种故障条件下安全运作。 我们得出结论, ML 代理器剂展示了巨大的潜力, SOL 可以帮助建立它们。