政策决策的排序 (Ranking Policy Decisions)

Policies trained via Reinforcement Learning (RL) are often needlessly complex, making them difficult to analyse and interpret. In a run with $n$ time steps, a policy will make $n$ decisions on actions to take; we conjecture that only a small subset of these decisions delivers value over selecting a simple default action. Given a trained policy, we propose a novel black-box method based on statistical fault localisation that ranks the states of the environment according to the importance of decisions made in those states. We argue that among other things, the ranked list of states can help explain and understand the policy. As the ranking method is statistical, a direct evaluation of its quality is hard. As a proxy for quality, we use the ranking to create new, simpler policies from the original ones by pruning decisions identified as unimportant (that is, replacing them by default actions) and measuring the impact on performance. Our experiments on a diverse set of standard benchmarks demonstrate that pruned policies can perform on a level comparable to the original policies. Conversely, we show that naive approaches for ranking policy decisions, e.g., ranking based on the frequency of visiting a state, do not result in high-performing pruned policies.

翻译：通过强化学习(RL)培训的政策往往不必要地复杂,因此难以分析和解释。在用美元的时间步骤运行过程中,一项政策将就应采取的行动做出一美元的决定;我们推测,这些决定中只有一小部分能提供价值,而不是选择简单的默认行动。根据经过培训的政策,我们提议基于统计错误的黑箱方法,根据各州所作决定的重要性排列环境状况。我们争辩说,排名国家名单除其他外,可以帮助解释和理解该政策。由于排名方法是统计性的,直接评估其质量是困难的。作为质量的代名词,我们利用排名来从最初决定中创建新的、更简单的政策,方法是通过调整被确定为不重要的决定(即用默认行动取代这些决定)和衡量对业绩的影响。我们对一套不同的标准基准的实验表明,经调整的政策可以达到与最初政策相类似的水平。相反,我们展示了对政策决策排序的天真的方法,例如,根据访问国家的频率进行排序,而不是高绩效政策的结果。