Understanding a decision-maker's priorities by observing their behavior is critical for transparency and accountability in decision processes, such as in healthcare. Though conventional approaches to policy learning almost invariably assume stationarity in behavior, this is hardly true in practice: Medical practice is constantly evolving as clinical professionals fine-tune their knowledge over time. For instance, as the medical community's understanding of organ transplantations has progressed over the years, a pertinent question is: How have actual organ allocation policies been evolving? To give an answer, we desire a policy learning method that provides interpretable representations of decision-making, in particular capturing an agent's non-stationary knowledge of the world, as well as operating in an offline manner. First, we model the evolving behavior of decision-makers in terms of contextual bandits, and formalize the problem of Inverse Contextual Bandits ("ICB"). Second, we propose two concrete algorithms as solutions, learning parametric and nonparametric representations of an agent's behavior. Finally, using both real and simulated data for liver transplantations, we illustrate the applicability and explainability of our method, as well as benchmarking and validating the accuracy of our algorithms.
翻译:通过观察决策者的行为了解他们的优先事项对于决策过程的透明度和问责制至关重要,例如在医疗保健方面。虽然传统的政策学习方法几乎总是假定行为是固定的,但在实践中却很少如此:医学实践是随着临床专业人员在一段时间内对其知识进行微调而不断演进的。例如,医学界对器官移植的理解多年来有所进展,一个相关的问题是:实际器官分配政策是如何演变的?为了给出答案,我们希望一种政策学习方法能够提供决策的可解释的表述,特别是获取代理人对世界的非固定性知识,以及以离线方式运作。首先,我们用背景强盗来模拟决策者不断变化的行为,并将反内幕强盗(“ICB”)问题正式化。第二,我们提出两种具体算法作为解决办法,学习对代理人行为的参数和非对称。最后,我们用真实和模拟的数据来说明我们的方法的适用性和解释性,以及我们的算法的基准和准确性。