Many real-world problems contain multiple objectives and agents, where a trade-off exists between objectives. Key to solving such problems is to exploit sparse dependency structures that exist between agents. For example, in wind farm control a trade-off exists between maximising power and minimising stress on the systems components. Dependencies between turbines arise due to the wake effect. We model such sparse dependencies between agents as a multi-objective coordination graph (MO-CoG). In multi-objective reinforcement learning a utility function is typically used to model a users preferences over objectives, which may be unknown a priori. In such settings a set of optimal policies must be computed. Which policies are optimal depends on which optimality criterion applies. If the utility function of a user is derived from multiple executions of a policy, the scalarised expected returns (SER) must be optimised. If the utility of a user is derived from a single execution of a policy, the expected scalarised returns (ESR) criterion must be optimised. For example, wind farms are subjected to constraints and regulations that must be adhered to at all times, therefore the ESR criterion must be optimised. For MO-CoGs, the state-of-the-art algorithms can only compute a set of optimal policies for the SER criterion, leaving the ESR criterion understudied. To compute a set of optimal polices under the ESR criterion, also known as the ESR set, distributions over the returns must be maintained. Therefore, to compute a set of optimal policies under the ESR criterion for MO-CoGs, we present a novel distributional multi-objective variable elimination (DMOVE) algorithm. We evaluate DMOVE in realistic wind farm simulations. Given the returns in real-world wind farm settings are continuous, we utilise a model known as real-NVP to learn the continuous return distributions to calculate the ESR set.
翻译:许多现实世界问题包含多重目标和代理物, 目标之间存在权衡。 解决这类问题的关键在于利用代理物之间存在的零散依赖性结构。 例如, 在风力农场控制中, 最大功率和最小化系统元件压力之间存在权衡。 涡轮之间的依赖性会因觉醒效应而产生。 我们以多目标协调图(MO- CoG)为模范, 在多目标强化学习一个实用功能通常用来模拟用户偏好于目标, 这可能是先验的。 在这种环境下, 必须计算一套最佳政策。 哪种政策最优取决于适用哪种最佳性标准。 如果用户的功用功能产生于政策的多次执行, 涡轮机之间的预期回报(SER)必须优化。 如果用户的效用来自一个多目标协调图(MO-CoG) 的单一执行, 所预期的快速回报(ESR) 标准必须优化。 风力农场的回报(ESRFO) 必须是我们所知道的最佳回报政策在任何时候都必须遵守的制约和规章, 因此, ESR 标准必须选择E- VER 标准 的变量。