Understanding an agent's priorities by observing their behavior is critical for transparency and accountability in decision processes, such as in healthcare. While conventional approaches to policy learning almost invariably assume stationarity in behavior, this is hardly true in practice: Medical practice is constantly evolving, and clinical professionals are constantly fine-tuning their priorities. We desire an approach to policy learning that provides (1) interpretable representations of decision-making, accounts for (2) non-stationarity in behavior, as well as operating in an (3) offline manner. First, we model the behavior of learning agents in terms of contextual bandits, and formalize the problem of inverse contextual bandits (ICB). Second, we propose two algorithms to tackle ICB, each making varying degrees of assumptions regarding the agent's learning strategy. Finally, through both real and simulated data for liver transplantations, we illustrate the applicability and explainability of our method, as well as validating its accuracy.
翻译:通过观察代理人的行为来了解其行为的优先次序,对于决策过程,例如保健过程的透明度和问责制至关重要。虽然传统的政策学习方法几乎总是假定行为的固定性,但在实践中却很难做到:医学实践在不断演变,临床专业人员在不断调整其优先事项。我们希望采取政策学习方法,提供:(1) 可解释的决策说明,说明(2) 行为上的不固定性,以及以离线方式运作。首先,我们用背景强盗来模拟学习代理人的行为,并将反背景强盗的问题正式化。第二,我们建议采用两种算法来处理ICB(ICB),每种算法都对代理人的学习战略作出不同程度的假设。最后,通过真实和模拟的肝脏移植数据,我们说明我们的方法的适用性和可解释性,并证实其准确性。