In most applications of model-based Markov decision processes, the parameters for the unknown underlying model are often estimated from the empirical data. Due to noise, the policy learnedfrom the estimated model is often far from the optimal policy of the underlying model. When applied to the environment of the underlying model, the learned policy results in suboptimal performance, thus calling for solutions with better generalization performance. In this work we take a Bayesian perspective and regularize the objective function of the Markov decision process with prior information in order to obtain more robust policies. Two approaches are proposed, one based on $L^1$ regularization and the other on relative entropic regularization. We evaluate our proposed algorithms on synthetic simulations and on real-world search logs of a large scale online shopping store. Our results demonstrate the robustness of regularized MDP policies against the noise present in the models.
翻译:在大多数基于模型的Markov决策过程的应用中,未知基本模型的参数往往从经验数据中估算出来。由于噪音,从估计模型中得出的政策往往与基础模型的最佳政策相去甚远。当应用到基础模型的环境时,所学的政策结果不尽理想,因此要求采用更概括化的绩效的解决办法。在这项工作中,我们从巴伊西亚的角度出发,用先前的信息将Markov决策过程的客观功能正规化,以便获得更健全的政策。提出了两种办法,一种是按1美元的正规化法,另一种是按相对的昆虫正规化法。我们评估了我们提议的合成模拟算法和大型在线购物商店真实世界搜索日志。我们的结果表明,正规的MDP政策对于模型中出现的噪音是健全的。