Online reinforcement learning (RL) algorithms are often difficult to deploy in complex human-facing applications as they may learn slowly and have poor early performance. To address this, we introduce a practical algorithm for incorporating human insight to speed learning. Our algorithm, Constraint Sampling Reinforcement Learning (CSRL), incorporates prior domain knowledge as constraints/restrictions on the RL policy. It takes in multiple potential policy constraints to maintain robustness to misspecification of individual constraints while leveraging helpful ones to learn quickly. Given a base RL learning algorithm (ex. UCRL, DQN, Rainbow) we propose an upper confidence with elimination scheme that leverages the relationship between the constraints, and their observed performance, to adaptively switch among them. We instantiate our algorithm with DQN-type algorithms and UCRL as base algorithms, and evaluate our algorithm in four environments, including three simulators based on real data: recommendations, educational activity sequencing, and HIV treatment sequencing. In all cases, CSRL learns a good policy faster than baselines.
翻译:在线强化学习(RL)算法往往难以在复杂的人性化应用中应用,因为他们可能学习缓慢,早期性能差。为了解决这个问题,我们引入了将人类洞察力纳入速进学习的实用算法。我们的算法“限制抽样强化学习(CSRL)”将先前的域知识纳入到对RL政策的限制/限制中。在多种潜在的政策限制下,保持个人制约因素的稳健性,同时利用有帮助的人迅速学习。考虑到一个基本的RL学习算法(例如UCRL、DQN、彩虹),我们建议对消除方案建立高度信心,利用这些制约因素及其观察到的性能之间的关系,以适应性地在它们之间转换。我们用DQN型算法和UCRL进行快速的算法,并在四个环境中评估我们的算法,包括基于真实数据的三个模拟器:建议、教育活动排序和艾滋病毒治疗测序。在所有情况下,CSRL学习一个比基线更快的好的政策。