Batch reinforcement learning (RL) aims at leveraging pre-collected data to find an optimal policy that maximizes the expected total rewards in a dynamic environment. The existing methods require absolutely continuous assumption (e.g., there do not exist non-overlapping regions) on the distribution induced by target policies with respect to the data distribution over either the state or action or both. We propose a new batch RL algorithm that allows for singularity for both state and action spaces (e.g., existence of non-overlapping regions between offline data distribution and the distribution induced by the target policies) in the setting of an infinite-horizon Markov decision process with continuous states and actions. We call our algorithm STEEL: SingulariTy-awarE rEinforcement Learning. Our algorithm is motivated by a new error analysis on off-policy evaluation, where we use maximum mean discrepancy, together with distributionally robust optimization, to characterize the error of off-policy evaluation caused by the possible singularity and to enable model extrapolation. By leveraging the idea of pessimism and under some technical conditions, we derive a first finite-sample regret guarantee for our proposed algorithm under singularity. Compared with existing algorithms,by requiring only minimal data-coverage assumption, STEEL improves the applicability and robustness of batch RL. In addition, a two-step adaptive STEEL, which is nearly tuning-free, is proposed. Extensive simulation studies and one (semi)-real experiment on personalized pricing demonstrate the superior performance of our methods in dealing with possible singularity in batch RL.
翻译:暂无翻译