In many contemporary applications such as healthcare, finance, robotics, and recommendation systems, continuous deployment of new policies for data collection and online learning is either cost ineffective or impractical. We consider a setting that lies between pure offline reinforcement learning (RL) and pure online RL called deployment constrained RL in which the number of policy deployments for data sampling is limited. To solve this challenging task, we propose a new algorithmic learning framework called Model-based Uncertainty regularized and Sample Efficient Batch Optimization (MUSBO). Our framework discovers novel and high quality samples for each deployment to enable efficient data collection. During each offline training session, we bootstrap the policy update by quantifying the amount of uncertainty within our collected data. In the high support region (low uncertainty), we encourage our policy by taking an aggressive update. In the low support region (high uncertainty) when the policy bootstraps into the out-of-distribution region, we downweight it by our estimated uncertainty quantification. Experimental results show that MUSBO achieves state-of-the-art performance in the deployment constrained RL setting.
翻译:在许多当代应用中,如保健、金融、机器人和建议系统,连续部署数据收集和在线学习的新政策要么没有成本效益,要么是不切实际的。我们认为,纯粹的离线强化学习(RL)和纯的在线RL之间的一种环境,称之为部署限制(RL),其中用于数据抽样的政策部署数量有限。为解决这一具有挑战性的任务,我们提议了一个新的算法学习框架,称为基于模型的不确定性常规化和抽样高效批量优化(MUSBO)。我们的框架为每项部署发现新的和高质量的样本,以便有效地收集数据。在每次离线培训中,我们通过量化所收集的数据中的不确定性数量来引导政策更新。在高支持区域(低不确定性),我们鼓励我们的政策,采取积极的更新。在低支持区域(高度不确定性),当政策进入分配区域时,我们用估计的不确定性量化来降低它。实验结果显示,MUSBO在部署受限制的RL设置中取得了最先进的业绩。