K-SHAP: " 匿名国家行动对等 " 政策分类法 (K-SHAP: Policy Clustering Algorithm for Anonymous State-Action Pairs)

Learning agent behaviors from observational data has shown to improve our understanding of their decision-making processes, advancing our ability to explain their interactions with the environment and other agents. While multiple learning techniques have been proposed in the literature, there is one particular setting that has not been explored yet: multi agent systems where agent identities remain anonymous. For instance, in financial markets labeled data that identifies market participant strategies is typically proprietary, and only the anonymous state-action pairs that result from the interaction of multiple market participants are publicly available. As a result, sequences of agent actions are not observable, restricting the applicability of existing work. In this paper, we propose a Policy Clustering algorithm, called K-SHAP, that learns to group anonymous state-action pairs according to the agent policies. We frame the problem as an Imitation Learning (IL) task, and we learn a world-policy able to mimic all the agent behaviors upon different environmental states. We leverage the world-policy to explain each anonymous observation through an additive feature attribution method called SHAP (SHapley Additive exPlanations). Finally, by clustering the explanations we show that we are able to identify different agent policies and group observations accordingly. We evaluate our approach on simulated synthetic market data and a real-world financial dataset. We show that our proposal significantly and consistently outperforms the existing methods, identifying different agent strategies.

翻译：观察数据显示,从观察数据中学习代理人的行为有助于增进我们对其决策过程的理解,提高了我们解释其与环境和其他代理人互动的能力。虽然文献中提出了多种学习技术,但有一个尚未探讨的特殊环境:代理身份仍匿名的多代理系统。例如,在金融市场上标记的确定市场参与者战略的数据通常是专有的,只有多市场参与者互动产生的匿名州-州-行动对方才公开提供。因此,代理行动序列无法观察,限制了现有工作的适用性。我们在本文件中提议了一个政策分组算法,称为K-SHAP,根据代理政策学习将匿名国家行动对方组合起来。我们把这个问题作为模拟学习任务来设置,我们学习一种能够将所有代理行为都模拟在不同环境国家的行为的世界政策。我们利用世界政策来解释每一种匿名观察,即SHAP(SHapley Aditive Explationations) 。最后,我们通过将我们能够大量模拟的合成代理人政策以及我们现有的数据模型来显示我们的现有方法。