Contextual bandits, which leverage the baseline features of sequentially arriving individuals to optimize cumulative rewards while balancing exploration and exploitation, are critical for online decision-making. Existing approaches typically assume no interference, where each individual's action affects only their own reward. Yet, such an assumption can be violated in many practical scenarios, and the oversight of interference can lead to short-sighted policies that focus solely on maximizing the immediate outcomes for individuals, which further results in suboptimal decisions and potentially increased regret over time. To address this significant gap, we introduce the foresighted online policy with interference (FRONT) that innovatively considers the long-term impact of the current decision on subsequent decisions and rewards. The proposed FRONT method employs a sequence of exploratory and exploitative strategies to manage the intricacies of interference, ensuring robust parameter inference and regret minimization. Theoretically, we establish a tail bound for the online estimator and derive the asymptotic distribution of the parameters of interest under suitable conditions on the interference network. We further show that FRONT attains sublinear regret under two distinct definitions, capturing both the immediate and consequential impacts of decisions, and we establish these results with and without statistical inference. The effectiveness of FRONT is further demonstrated through extensive simulations and a real-world application to urban hotel profits.
翻译:情境赌博机利用顺序到达个体的基线特征来优化累积奖励,同时平衡探索与利用,对于在线决策至关重要。现有方法通常假设不存在干扰,即每个个体的行动仅影响其自身奖励。然而,这种假设在许多实际场景中可能被违背,而忽略干扰可能导致短视的策略,这些策略仅专注于最大化个体的即时结果,进而导致次优决策,并可能随时间增加累积遗憾。为弥补这一重要空白,我们提出了具有干扰的前瞻性在线策略(FRONT),该方法创新性地考虑了当前决策对后续决策和奖励的长期影响。所提出的FRONT方法采用一系列探索与利用策略来管理干扰的复杂性,确保稳健的参数推断和遗憾最小化。理论上,我们在干扰网络的适当条件下,为在线估计量建立了尾部边界,并推导了感兴趣参数的渐近分布。我们进一步证明,FRONT在两种不同的遗憾定义下均实现了亚线性遗憾,捕捉了决策的即时和后续影响,并且我们在有统计推断和无统计推断的情况下均建立了这些结果。FRONT的有效性通过广泛的模拟和一项关于城市酒店利润的实际应用得到了进一步验证。