Interpretability of reinforcement learning policies is essential for many real-world tasks but learning such interpretable policies is a hard problem. Particularly rule-based policies such as decision trees and rules lists are difficult to optimize due to their non-differentiability. While existing techniques can learn verifiable decision tree policies there is no guarantee that the learners generate a decision that performs optimally. In this work, we study the optimization of size-limited decision trees for Markov Decision Processes (MPDs) and propose OMDTs: Optimal MDP Decision Trees. Given a user-defined size limit and MDP formulation OMDT directly maximizes the expected discounted return for the decision tree using Mixed-Integer Linear Programming. By training optimal decision tree policies for different MDPs we empirically study the optimality gap for existing imitation learning techniques and find that they perform sub-optimally. We show that this is due to an inherent shortcoming of imitation learning, namely that complex policies cannot be represented using size-limited trees. In such cases, it is better to directly optimize the tree for expected return. While there is generally a trade-off between the performance and interpretability of machine learning models, we find that OMDTs limited to a depth of 3 often perform close to the optimal limit.
翻译:强化学习政策的解释性对于许多现实世界的任务至关重要,但学习这种可解释的政策是一个棘手的问题。特别是基于规则的政策,如决策树和规则清单等,由于其非差别性,很难优化优化。虽然现有技术可以学习可核查的决策树政策,但不能保证学习者能够产生最佳表现的决定。在这项工作中,我们研究Markov决策程序(MPDs)的尺寸限制决策树优化,并提议OMDTs:最佳 MDP决定树。鉴于用户定义的大小限制和MDP的制定,OMDT直接利用混合和内部线性规划使决策树的预期折扣回报最大化。通过为不同的MDPs培训最佳决策树政策,我们从经验上研究现有模仿学习技术的最佳性差距,发现它们表现不理想。我们表明,这是模仿学习的内在缺陷,即复杂的政策不能代表大小限制的树木。在这种情况下,OMDTs直接优化了预期返回的树的预期折扣回报率。我们通常通过培训最佳决策树树的深度限制,但我们在最接近的MDMM的模型之间找到最接近的深度。