We consider a stochastic multi-armed bandit (MAB) problem with delayed impact of actions. In our setting, actions taken in the past impact the arm rewards in the subsequent future. This delayed impact of actions is prevalent in the real world. For example, the capability to pay back a loan for people in a certain social group might depend on historically how frequently that group has been approved loan applications. If banks keep rejecting loan applications to people in a disadvantaged group, it could create a feedback loop and further damage the chance of getting loans for people in that group. In this paper, we formulate this delayed and long-term impact of actions within the context of multi-armed bandits. We generalize the bandit setting to encode the dependency of this "bias" due to the action history during learning. The goal is to maximize the collected utilities over time while taking into account the dynamics created by the delayed impacts of historical actions. We propose an algorithm that achieves a regret of $\tilde{\mathcal{O}}(KT^{2/3})$ and show a matching regret lower bound of $\Omega(KT^{2/3})$, where $K$ is the number of arms and $T$ is the learning horizon. Our results complement the bandit literature by adding techniques to deal with actions with long-term impacts and have implications in designing fair algorithms.
翻译:我们考虑的是行动延迟影响的多武装匪徒(MAB)问题。 在我们的场合中,过去采取的行动影响着以后的手臂奖赏。 这种行动的延迟影响在现实世界中很普遍。 例如, 向某个社会群体的人偿还贷款的能力可能取决于该群体历来被批准的贷款申请的频率。 如果银行继续拒绝向处境不利群体的人提供贷款申请,它可能会产生反馈回路,进一步损害为该群体的人获得贷款的机会。 在本文中,我们阐述了在多武装匪徒背景下采取行动的这种延迟和长期影响。 我们推广了土匪设置,以记录由于学习过程中的行动历史而导致的“偏见”的依赖性。 目标是在考虑历史行动延迟影响所造成的动态的同时,在一段时间内最大限度地增加所收集的公用设施。 我们提出一种算法, 以美元为遗憾, 为该群体中的人获得贷款的机会。 在本文中,我们在多武装匪徒中制定的行动的这种迟缓和长期影响。 我们的框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框框