Offline reinforcement-learning (RL) algorithms learn to make decisions using a given, fixed training dataset without the possibility of additional online data collection. This problem setting is captivating because it holds the promise of utilizing previously collected datasets without any costly or risky interaction with the environment. However, this promise also bears the drawback of this setting. The restricted dataset induces subjective uncertainty because the agent can encounter unfamiliar sequences of states and actions that the training data did not cover. Moreover, inherent system stochasticity further increases uncertainty and aggravates the offline RL problem, preventing the agent from learning an optimal policy. To mitigate the destructive uncertainty effects, we need to balance the aspiration to take reward-maximizing actions with the incurred risk due to incorrect ones. In financial economics, modern portfolio theory (MPT) is a method that risk-averse investors can use to construct diversified portfolios that maximize their returns without unacceptable levels of risk. We integrate MPT into the agent's decision-making process to present a simple-yet-highly-effective risk-aware planning algorithm for offline RL. Our algorithm allows us to systematically account for the \emph{estimated quality} of specific actions and their \emph{estimated risk} due to the uncertainty. We show that our approach can be coupled with the Transformer architecture to yield a state-of-the-art planner for offline RL tasks, maximizing the return while significantly reducing the variance.
翻译:离线强化学习( RL) 算法学会使用特定固定的培训数据集来做决定,而没有可能增加在线数据收集。 问题设置之所以令人着迷,是因为它有希望使用先前收集的数据集,而不会与环境发生任何代价或风险的相互作用。 但是,这一承诺也带有这一环境的缺点。 限制的数据集会引起主观不确定性,因为代理商可能遇到不为人所知的一系列国家和培训数据没有覆盖的行动。 此外, 固有的系统差异性会进一步增加不确定性,加剧离线的RL问题,使代理商无法学习最佳政策。 为了减轻破坏性的不确定性效应,我们需要平衡利用奖励最大化行动的愿望与因错误风险而产生的风险。 在金融经济学中,现代组合理论(MPT)是风险投资者可以用来构建多样化组合的方法,从而在不不可接受的风险水平上实现最大回报。 我们将 MPT纳入代理的决策过程, 以简单而高效的风险意识规划算法为离线的代理商提供了一种最优化的风险规划算法。 我们的算法允许我们系统化地计算出具体风险的回报结构。