联合学习环境和控制政策 (Jointly Learning Environments and Control Policies with Projected Stochastic Gradient Ascent)

We consider the joint design and control of discrete-time stochastic dynamical systems over a finite time horizon. We formulate the problem as a multi-step optimization problem under uncertainty seeking to identify a system design and a control policy that jointly maximize the expected sum of rewards collected over the time horizon considered. The transition function, the reward function and the policy are all parametrized, assumed known and differentiable with respect to their parameters. We then introduce a deep reinforcement learning algorithm combining policy gradient methods with model-based optimization techniques to solve this problem. In essence, our algorithm iteratively approximates the gradient of the expected return via Monte-Carlo sampling and automatic differentiation and takes projected gradient ascent steps in the space of environment and policy parameters. This algorithm is referred to as Direct Environment and Policy Search (DEPS). We assess the performance of our algorithm in three environments concerned with the design and control of a mass-spring-damper system, a small-scale off-grid power system and a drone, respectively. In addition, our algorithm is benchmarked against a state-of-the-art deep reinforcement learning algorithm used to tackle joint design and control problems. We show that DEPS performs at least as well or better in all three environments, consistently yielding solutions with higher returns in fewer iterations. Finally, solutions produced by our algorithm are also compared with solutions produced by an algorithm that does not jointly optimize environment and policy parameters, highlighting the fact that higher returns can be achieved when joint optimization is performed.

翻译：我们考虑在一个有限的时间范围内联合设计和控制离散时间随机动态系统。我们将这一问题作为一个多步优化问题,在不确定性下提出,力求确定一个系统设计和控制政策,共同最大限度地增加在所考虑的时间范围内收集的预期报酬总额。过渡功能、奖励功能和政策就其参数而言都是相互兼容的、已知的和不同的。然后我们引入一个深度强化学习算法,将政策梯度方法与基于模型的优化技术相结合,以解决这一问题。实质上,我们的算法反复地接近通过蒙特-卡洛抽样和自动区分实现预期回报的梯度,并在环境和政策参数空间中采用预测的梯度步骤。这种算法被称为直接环境和政策搜索。我们评估我们在三个环境中的算法绩效,这三个环境涉及大规模溢价系统的设计和控制,一个小规模离网动力系统和一个无人机,一个解决该问题的方法。此外,我们的算法是以最先进的更高级的强化参数为基准,在环境和政策参数空间空间的预测梯度方面,以及政策参数参数的变异性步骤,称为直接的环境和政策搜索(DEPS),最后我们评估我们所制作的算算法在三个环境上的表现,一个更好的回程,一个比得更精确的回报,一个更好的解决方案,一个比一个更精确的回报环境,一个更精确的回报环境,一个比一个更精确的周期,一个比一个比一个更好的环境,一个比一个比一个更精确的周期。