In batch reinforcement learning, there can be poorly explored state-action pairs resulting in poorly learned, inaccurate models and poorly performing associated policies. Various regularization methods can mitigate the problem of learning overly-complex models in Markov decision processes (MDPs), however they operate in technically and intuitively distinct ways and lack a common form in which to compare them. This paper unifies three regularization methods in a common framework -- a weighted average transition matrix. Considering regularization methods in this common form illuminates how the MDP structure and the state-action pair distribution of the batch data set influence the relative performance of regularization methods. We confirm intuitions generated from the common framework by empirical evaluation across a range of MDPs and data collection policies.
翻译:在批量强化学习中,对邦行动办法的探讨可能不善,导致学习不良、模型不准确、相关政策执行不力;各种正规化方法可以缓解在Markov决策过程中学习过于复杂的模式的问题,尽管它们以技术和直觉上的独特方式运作,而且缺乏共同的比较形式;本文件将三种正规化方法统一在一个共同框架内 -- -- 加权平均过渡矩阵;考虑到这种通用形式的正规化方法,说明MDP结构和批量数据集的州-行动配对分配如何影响正规化方法的相对性能;我们确认通过对一系列多元元方案和数据收集政策进行实证评价从共同框架产生的直觉。