An important step in the design of autonomous systems is to evaluate the probability that a failure will occur. In safety-critical domains, the failure probability is extremely small so that the evaluation of a policy through Monte Carlo sampling is inefficient. Adaptive importance sampling approaches have been developed for rare event estimation but do not scale well to sequential systems with long horizons. In this work, we develop two adaptive importance sampling algorithms that can efficiently estimate the probability of rare events for sequential decision making systems. The basis for these algorithms is the minimization of the Kullback-Leibler divergence between a state-dependent proposal distribution and a target distribution over trajectories, but the resulting algorithms resemble policy gradient and value-based reinforcement learning. We apply multiple importance sampling to reduce the variance of our estimate and to address the issue of multi-modality in the optimal proposal distribution. We demonstrate our approach on a control task with both continuous and discrete actions spaces and show accuracy improvements over several baselines.
翻译:设计自主系统的一个重要步骤是评估失败概率。在安全关键领域,失败概率极小,因此通过蒙特卡洛取样对政策的评价效率低下。为稀有事件估计开发了适应性重要取样方法,但规模不及长视野的相继系统。在这项工作中,我们开发了两种适应性重要取样算法,可以有效估计顺序决策系统发生罕见事件的概率。这些算法的基础是最大限度地减少国家依赖的投标书分布和轨道目标分布之间的 Kullback-Leeper差异,但由此产生的算法类似于政策梯度和基于价值的强化学习。我们应用多重重要性取样来缩小我们估计数的差异,并在最佳提案分配中解决多模式问题。我们展示了我们关于连续和离散行动空间的控制任务的方法,并表明在若干基线上的准确性改进。