In off-policy reinforcement learning, a behaviour policy performs exploratory interactions with the environment to obtain state-action-reward samples which are then used to learn a target policy that optimises the expected return. This leads to a problem of off-policy evaluation, where one needs to evaluate the target policy from samples collected by the often unrelated behaviour policy. Importance sampling is a traditional statistical technique that is often applied to off-policy evaluation. While importance sampling estimators are unbiased, their variance increases exponentially with the horizon of the decision process due to computing the importance weight as a product of action probability ratios, yielding estimates with low accuracy for domains involving long-term planning. This paper proposes state-based importance sampling (SIS), which drops the action probability ratios of sub-trajectories with "neglible states" -- roughly speaking, those for which the chosen actions have no impact on the return estimate -- from the computation of the importance weight. Theoretical results show that this results in a reduction of the exponent in the variance upper bound as well as improving the mean squared error. An automated search algorithm based on covariance testing is proposed to identify a negligible state set which has minimal MSE when performing state-based importance sampling. Experiments are conducted on a lift domain, which include "lift states" where the action has no impact on the following state and reward. The results demonstrate that using the search algorithm, SIS yields reduced variance and improved accuracy compared to traditional importance sampling, per-decision importance sampling, and incremental importance sampling.
翻译:在政策外强化学习中,行为政策对环境进行探索性互动,以获得州-行动概率比率的比重,从而得出低精确度的估计数,从而得出选择预期回报的目标政策。这导致了一个非政策评价问题,即需要从通常不相干的行为政策采集的样本中评价目标政策。重要性抽样是一种传统的统计技术,通常适用于非政策性评价。虽然抽样估计者没有偏见,但由于计算作为行动概率比率产物的比重,其差异随决策过程的视野而急剧增加,从而计算出作为行动概率比率的比重,从而得出对长期规划所涉领域低精确度的估计数。本文提议进行基于国家基准的抽样(SIS),以降低具有“模糊状态”的次轨迹的次轨迹行动概率比率比率。大致上来说,所选择的行动对回报估计没有影响。理论结果表明,这导致增量差异的比重减少,同时改进了平均平方错误。基于变量测试的自动搜索算法,提议采用基于稳定性测试的比重(SIS)的比重抽样评估重要性,即进行最低程度的比重,进行实验的比重。