Reinforcement learning (RL) has shown promise for decision-making tasks in real-world applications. One practical framework involves training parameterized policy models from an offline dataset and subsequently deploying them in an online environment. However, this approach can be risky since the offline training may not be perfect, leading to poor performance of the RL models that may take dangerous actions. To address this issue, we propose an alternative framework that involves a human supervising the RL models and providing additional feedback in the online deployment phase. We formalize this online deployment problem and develop two approaches. The first approach uses model selection and the upper confidence bound algorithm to adaptively select a model to deploy from a candidate set of trained offline RL models. The second approach involves fine-tuning the model in the online deployment phase when a supervision signal arrives. We demonstrate the effectiveness of these approaches for robot locomotion control and traffic light control tasks through empirical validation.
翻译:强化学习(RL)显示现实应用中决策任务的前景。一个实用框架涉及从离线数据集中培训参数化政策模型,随后在在线环境中部署这些模型。然而,这种方法可能风险很大,因为离线培训可能不完美,导致可能采取危险行动的RL模型性能不佳。为解决这一问题,我们提议了一个替代框架,其中涉及人力监督RL模型,并在在线部署阶段提供额外反馈。我们正式确定这一在线部署问题,并开发了两种方法。第一种方法利用模式选择和高度信任约束算法,从经过培训的离线模型中适应性地选择一个模型,以便在监督信号到达时对在线部署阶段的模式进行微调。我们通过经验验证,展示了机器人移动控制和交通灯控制任务这些方法的有效性。</s>