Offline reinforcement learning algorithms still lack trust in practice due to the risk that the learned policy performs worse than the original policy that generated the dataset or behaves in an unexpected way that is unfamiliar to the user. At the same time, offline RL algorithms are not able to tune their most important hyperparameter - the proximity of the learned policy to the original policy. We propose an algorithm that allows the user to tune this hyperparameter at runtime, thereby addressing both of the above mentioned issues simultaneously. This allows users to start with the original behavior and grant successively greater deviation, as well as stopping at any time when the policy deteriorates or the behavior is too far from the familiar one.
翻译:离线强化学习算法仍然缺乏对实践的信任,因为所学的政策可能比产生数据集的最初政策更差,或以用户不熟悉的意想不到的方式行事。 同时,离线 RL 算法无法调和其最重要的超参数 -- -- 所学政策与原政策之间的距离。 我们提议了一个允许用户在运行时调和这个超参数的算法,从而同时处理上述两个问题。 这使用户能够从最初的行为开始,并连续出现更大的偏差,以及在政策恶化或行为离熟悉政策太远时随时停止。