华尔街树木搜索:离线强化学习风险-软件规划 (Wall Street Tree Search: Risk-Aware Planning for Offline Reinforcement Learning)

Offline reinforcement-learning (RL) algorithms learn to make decisions using a given, fixed training dataset without online data collection. This problem setting is captivating because it holds the promise of utilizing previously collected datasets without any costly or risky interaction with the environment. However, this promise also bears the drawback of this setting as the restricted dataset induces uncertainty because the agent can encounter unfamiliar sequences of states and actions that the training data did not cover. To mitigate the destructive uncertainty effects, we need to balance the aspiration to take reward-maximizing actions with the incurred risk due to incorrect ones. In financial economics, modern portfolio theory (MPT) is a method that risk-averse investors can use to construct diversified portfolios that maximize their returns without unacceptable levels of risk. We propose integrating MPT into the agent's decision-making process, presenting a new simple-yet-highly-effective risk-aware planning algorithm for offline RL. Our algorithm allows us to systematically account for the \emph{estimated quality} of specific actions and their \emph{estimated risk} due to the uncertainty. We show that our approach can be coupled with the Transformer architecture to yield a state-of-the-art planner, which maximizes the return for offline RL tasks. Moreover, our algorithm reduces the variance of the results significantly compared to conventional Transformer decoding, which results in a much more stable algorithm -- a property that is essential for the offline RL setting, where real-world exploration and failures can be costly or dangerous.

翻译：离线强化学习( RL) 算法学会在不在线数据收集的情况下使用特定固定的培训数据集来作出决定。这一问题的设定之所以令人着迷,是因为它有可能使用先前收集的数据集,而不会与环境发生任何代价或风险的相互作用。但是,这一承诺也具有这一背景的缺点,因为限制的数据集会引起不确定性,因为代理商可能遇到培训数据没有覆盖的不熟悉的一系列国家和行动。为了减轻破坏性的不确定性效应,我们需要平衡采取奖励最大化行动的愿望和因错误行动而引发的风险。在金融经济学中,现代组合理论(MPT)是一种方法,风险规避投资者可以用来构建多样化的投资组合,使其回报最大化而不不可接受的风险水平。我们提议将MPT纳入该代理商的决策过程,为离线的RL提供一个新的简单且高效的风险意识规划算法。我们的算法允许我们系统地考虑具体行动及其风险的计算质量。相对于不确定性而言, 现代组合理论(MPT) 是一种方法, 其多样化的组合可以用来构建多样化的投资组合, 并且让我们的收益结构可以大幅降低。