Despite its experimental success, Model-based Reinforcement Learning still lacks a complete theoretical understanding. To this end, we analyze the error in the cumulative reward using a contraction approach. We consider both stochastic and deterministic state transitions for continuous (non-discrete) state and action spaces. This approach doesn't require strong assumptions and can recover the typical quadratic error to the horizon. We prove that branched rollouts can reduce this error and are essential for deterministic transitions to have a Bellman contraction. Our analysis of policy mismatch error also applies to Imitation Learning. In this case, we show that GAN-type learning has an advantage over Behavioral Cloning when its discriminator is well-trained.
翻译:尽管实验成功,基于模型的强化学习仍然缺乏完全的理论理解。 为此,我们使用收缩方法分析累积奖励中的错误。 我们考虑连续(非分解)状态和行动空间的随机性和确定性状态过渡。 这个方法不需要强有力的假设,可以将典型的二次错误恢复到地平线。 我们证明分支推出可以减少这一错误,对于决定性过渡产生贝尔曼收缩至关重要。 我们对政策错配性错误的分析也适用于模拟学习。 在这个例子中,我们证明GAN型的学习在歧视者受过良好训练时比行为性克隆具有优势。