In these notes we will tackle the problem of finding optimal policies for Markov decision processes (MDPs) which are not fully known to us. Our intention is to slowly transition from an offline setting to an online (learning) setting. Namely, we are moving towards reinforcement learning.
翻译:在这些说明中,我们将解决为我们并不完全了解的Markov决策程序找到最佳政策的问题,我们的意图是缓慢地从脱线向在线(学习)环境过渡。 也就是说,我们正在向强化学习过渡。