Understanding a decision-maker's priorities by observing their behavior is critical for transparency and accountability in decision processes, such as in healthcare. Though conventional approaches to policy learning almost invariably assume stationarity in behavior, this is hardly true in practice: Medical practice is constantly evolving as clinical professionals fine-tune their knowledge over time. For instance, as the medical community's understanding of organ transplantations has progressed over the years, a pertinent question is: How have actual organ allocation policies been evolving? To give an answer, we desire a policy learning method that provides interpretable representations of decision-making, in particular capturing an agent's non-stationary knowledge of the world, as well as operating in an offline manner. First, we model the evolving behavior of decision-makers in terms of contextual bandits, and formalize the problem of Inverse Contextual Bandits (ICB). Second, we propose two concrete algorithms as solutions, learning parametric and nonparametric representations of an agent's behavior. Finally, using both real and simulated data for liver transplantations, we illustrate the applicability and explainability of our method, as well as benchmarking and validating its accuracy.
翻译:通过观察决策者的行为了解他们的优先事项对于决策过程的透明度和问责制至关重要,例如在医疗保健方面。虽然传统的政策学习方法几乎总是假定行为是固定的,但在实践中却几乎不那么正确:医学实践是随着临床专业人员随着时间的推移对其知识进行微调而不断演进的。例如,医学界对器官移植的理解多年来有所进展,一个相关的问题是:实际器官分配政策是如何演变的?为了给出答案,我们希望一种政策学习方法能够提供可解释的决策说明,特别是获取代理人对世界的非固定性知识,以及以离线方式运作。首先,我们用背景强盗来模拟决策者不断变化的行为,并正式确定反内幕强盗(ICB)问题。第二,我们提出两种具体的算法作为解决办法,学习对代理人行为的分量和非直截面表现。最后,我们用真实和模拟的数据来说明我们的方法的适用性和解释性,以及基准和验证其准确性。