Behavioural characterizations (BCs) of decision-making agents, or their policies, are used to study outcomes of training algorithms and as part of the algorithms themselves to encourage unique policies, match expert policy or restrict changes to policy per update. However, previously presented solutions are not applicable in general, either due to lack of expressive power, computational constraint or constraints on the policy or environment. Furthermore, many BCs rely on the actions of policies. We discuss and demonstrate how these BCs can be misleading, especially in stochastic environments, and propose a novel solution based on what states policies visit. We run experiments to evaluate the quality of the proposed BC against baselines and evaluate their use in studying training algorithms, novelty search and trust-region policy optimization. The code is available at https://github.com/miffyli/policy-supervectors.
翻译:决策人员或其政策的行为特征(BCs)被用于研究培训算法的结果,并作为算法本身的一部分,以鼓励独特的政策、匹配专家政策或限制对政策的修改,然而,以前提出的解决方案一般不适用,原因有二:缺乏表达力、计算限制或政策或环境的制约。此外,许多BCs依靠政策行动。我们讨论并演示这些BCs如何产生误导,特别是在随机环境中,并根据各国政策访问的内容提出新的解决方案。我们进行实验,对照基线评估拟议的BC的质量,并评估其在研究培训算法、新颖搜索和信任区域政策优化方面的使用情况。该代码可在https://github.com/miffili/policy-supervictors查阅。