最终到最终风险-风险-软件软件 MDP 规划和政策学习,通过反向宣传进行 (RAPTOR: End-to-end Risk-Aware MDP Planning and Policy Learning by Backpropagation)

Planning provides a framework for optimizing sequential decisions in complex environments. Recent advances in efficient planning in deterministic or stochastic high-dimensional domains with continuous action spaces leverage backpropagation through a model of the environment to directly optimize actions. However, existing methods typically not take risk into account when optimizing in stochastic domains, which can be incorporated efficiently in MDPs by optimizing the entropic utility of returns. We bridge this gap by introducing Risk-Aware Planning using PyTorch (RAPTOR), a novel framework for risk-sensitive planning through end-to-end optimization of the entropic utility objective. A key technical difficulty of our approach lies in that direct optimization of the entropic utility by backpropagation is impossible due to the presence of environment stochasticity. The novelty of RAPTOR lies in the reparameterization of the state distribution, which makes it possible to apply stochastic backpropagatation through sufficient statistics of the entropic utility computed from forward-sampled trajectories. The direct optimization of this empirical objective in an end-to-end manner is called the risk-averse straight-line plan, which commits to a sequence of actions in advance and can be sub-optimal in highly stochastic domains. We address this shortcoming by optimizing for risk-aware Deep Reactive Policies (RaDRP) in our framework. We evaluate and compare these two forms of RAPTOR on three highly stochastic do-mains, including nonlinear navigation, HVAC control, and linear reservoir control, demonstrating the ability to manage risk in complex MDPs.

翻译：在复杂环境中优化顺序决策的规划提供了一个框架。最近,在确定性或随机高维领域的高效规划方面取得了一些进展,持续的行动空间通过环境模型进行反向调整,以直接优化行动。然而,现有方法通常在优化随机领域时不考虑风险,而通过优化回报的回归功能,可将风险-软件规划有效纳入元DP。我们通过使用PyTorrch(Raptor)引入风险-软件规划来弥合这一差距。 PyTorrch(Raptor)是一个新颖的框架,通过前方至端优化驱动功能目标,进行风险敏感规划。我们方法的一个主要技术困难在于通过一个环境模型,通过反向调整直接优化对导效用。但是,由于存在环境偏差性,现有方法通常不会考虑到风险。