行动秩序受限制的被破坏的环境强盗 (Corrupted Contextual Bandits with Action Order Constraints)

We consider a variant of the novel contextual bandit problem with corrupted context, which we call the contextual bandit problem with corrupted context and action correlation, where actions exhibit a relationship structure that can be exploited to guide the exploration of viable next decisions. Our setting is primarily motivated by adaptive mobile health interventions and related applications, where users might transitions through different stages requiring more targeted action selection approaches. In such settings, keeping user engagement is paramount for the success of interventions and therefore it is vital to provide relevant recommendations in a timely manner. The context provided by users might not always be informative at every decision point and standard contextual approaches to action selection will incur high regret. We propose a meta-algorithm using a referee that dynamically combines the policies of a contextual bandit and multi-armed bandit, similar to previous work, as wells as a simple correlation mechanism that captures action to action transition probabilities allowing for more efficient exploration of time-correlated actions. We evaluate empirically the performance of said algorithm on a simulation where the sequence of best actions is determined by a hidden state that evolves in a Markovian manner. We show that the proposed meta-algorithm improves upon regret in situations where the performance of both policies varies such that one is strictly superior to the other for a given time period. To demonstrate that our setting has relevant practical applicability, we evaluate our method on several real world data sets, clearly showing better empirical performance compared to a set of simple algorithms.

翻译：我们考虑的是具有腐败背景的新背景土匪问题的变体,我们称之为具有腐败背景和行动相关性的背景土匪问题,我们称之为具有腐败背景和行动相关性的背景土匪问题,在这种背景下,各种行动展现出一种可以用来指导探索未来可行决定的关系结构。我们的环境主要是由适应性流动保健干预措施和相关应用程序驱动的,用户可能在不同阶段过渡,需要采取更有针对性的行动选择方法。在这种环境下,保持用户的参与对于干预的成功至关重要,因此及时提供相关建议至关重要。用户提供的背景可能并不总是在每个决定点都提供情况,而标准的行动选择背景方法将备受很大遗憾。我们提出一个元algorthm,使用一个参考人来动态地将背景土匪和多武装土匪的政策结合起来,类似于以往的工作,以及一个简单的关联机制,以捕捉行动过渡概率,以便更有效地探索与时间有关的行动。我们从经验上评价上述算法的性表现,即以隐蔽状态决定最佳行动顺序,以Markovian方式演进。我们提议的一个参考人,用一个动态,将背景政策与以前一样,我们提出的一个更精确地展示了我们实际业绩的时期。我们提出的一种方法,以显示一种更精确地展示了另一种方法,在某种时间上的成绩。我们提出的一种比较。我们提出的一种比较。我们提出的一种比较了一种比较了一种比较。我们提出的一种比较了一种不同的方法,一种比较了一种比较了一种比较了一种比较了一种比较。