This paper studies an instance of the multi-armed bandit (MAB) problem, specifically where several causal MABs operate chronologically in the same dynamical system. Practically the reward distribution of each bandit is governed by the same non-trivial dependence structure, which is a dynamic causal model. Dynamic because we allow for each causal MAB to depend on the preceding MAB and in doing so are able to transfer information between agents. Our contribution, the Chronological Causal Bandit (CCB), is useful in discrete decision-making settings where the causal effects are changing across time and can be informed by earlier interventions in the same system. In this paper, we present some early findings of the CCB as demonstrated on a toy problem.
翻译:本文研究了多武装土匪(MAB)问题的一个实例,具体地说,在同一个动态系统中,几个因果土匪(MAB)按时间顺序运作。实际上,每个土匪的奖励分配都受同样的非三重依赖结构的制约,这是一个动态因果模式。动态是因为我们允许每个因果土匪(MAB)依赖先前的MAB,并在这样做时能够在代理商之间传递信息。我们的贡献,即Chronlogic Causal Bandit(CCCB),在不同的决策环境中很有用,因为在这种环境中,因果效应会随着时间的变化而变化,并且可以通过同一系统中的早期干预来了解。我们在本文件中介绍了CB的一些早期发现,如关于一个小问题所证明的那样。