We introduce CriticSMC, a new algorithm for planning as inference built from a novel composition of sequential Monte Carlo with learned soft-Q function heuristic factors. This algorithm is structured so as to allow using large numbers of putative particles leading to efficient utilization of computational resource and effective discovery of high reward trajectories even in environments with difficult reward surfaces such as those arising from hard constraints. Relative to prior art our approach is notably still compatible with model-free reinforcement learning in the sense that the implicit policy we produce can be used at test time in the absence of a world model. Our experiments on self-driving car collision avoidance in simulation demonstrate improvements against baselines in terms of infraction minimization relative to computational effort while maintaining diversity and realism of found trajectories.
翻译:我们引入了CriticSMC(CriticSMC),这是一个规划的新算法,它从相继的蒙特卡洛的新构成中推导出,具有学习的软Q功能超常因素,这种算法的结构允许使用大量模拟粒子,从而有效利用计算资源,并有效发现高回报轨迹,即使在诸如困难的奖赏表层,例如来自困难的制约物的环境里也是如此。与以往的艺术相比,我们的方法仍然明显地与无模型的强化学习相容,也就是说,在没有世界模型的情况下,我们在试验时可以使用我们制定的隐性政策。我们在模拟中进行自我驾驶避免汽车碰撞的实验表明,与计算努力相比,最小化与计算努力相对的基线相悖,同时保持所发现轨迹的多样性和现实主义。