We study best-arm identification with a fixed budget and contextual (covariate) information in stochastic multi-armed bandit problems. In each round, after observing contextual information, we choose a treatment arm using past observations and current context. Our goal is to identify the best treatment arm, a treatment arm with the maximal expected reward marginalized over the contextual distribution, with a minimal probability of misidentification. First, we derive semiparametric lower bounds for this problem, where we regard the gaps between the expected rewards of the best and suboptimal treatment arms as parameters of interest, and all other parameters, such as the expected rewards conditioned on contexts, as the nuisance parameters. We then develop the "Contextual RS-AIPW strategy," which consists of the random sampling (RS) rule tracking a target allocation ratio and the recommendation rule using the augmented inverse probability weighting (AIPW) estimator. Our proposed Contextual RS-AIPW strategy is optimal because the upper bound for the probability of misidentification matches the semiparametric lower bound when the budget goes to infinity, and the gaps converge to zero.
翻译:我们用固定预算和背景(共变)信息研究固定预算和多武装土匪问题的最佳武器识别信息。 每回合,在观察背景信息后,我们使用以往的观察和当前背景选择一个处理臂。我们的目标是确定最佳处理臂,这是在背景分布中处于边缘地位的最大预期奖赏的处理臂,其误判概率最小。首先,我们从这一问题中得出半对称下限,将最佳和次最佳处理臂的预期奖赏视为利益参数,以及所有其他参数,例如环境条件下的预期奖赏,作为骚扰参数。然后我们制定“原始的RS-AIPW战略 ”, 其中包括随机抽样(RS) 规则, 跟踪目标分配比率, 以及建议规则, 使用增加的反概率加权(AIPW) 。 我们提议的RS-AIPW 框架战略是最佳的, 因为错误识别概率的上限在预算走向无限性时与半偏差的较低界限相匹配, 差距接近于零。