We study best-arm identification with a fixed budget and contextual (covariate) information in stochastic multi-armed bandit problems. In each round, after observing contextual information, we choose a treatment arm using past observations and current context. Our goal is to identify the best treatment arm, a treatment arm with the maximal expected reward marginalized over the contextual distribution, with a minimal probability of misidentification. First, we derive semiparametric lower bounds of the misidentification probability for this problem, where we regard the gaps between the expected rewards of the best and suboptimal treatment arms as parameters of interest, and all other parameters, such as the expected rewards conditioned on contexts, as the nuisance parameters. We then develop the ``Contextual RS-AIPW strategy,'' which consists of the random sampling (RS) rule tracking a target allocation ratio and the recommendation rule using the augmented inverse probability weighting (AIPW) estimator. Our proposed Contextual RS-AIPW strategy is optimal because the upper bound for the probability of misidentification by the strategy matches the semiparametric lower bound, when the budget goes to infinity and the gaps converge to zero.
翻译:我们用固定的预算和背景(共变)信息研究在多武装土匪问题中的最佳武器识别方法; 在每轮中,在观察背景信息后,我们利用以往的观察和当前背景选择一个治疗臂; 我们的目标是确定最佳治疗臂,这是在背景分布中边缘化的最大预期奖赏,最小的误分概率。 首先,我们得出这一问题误分概率的半对数下限,其中我们把最佳和次最佳处理臂的预期奖励作为利益参数,以及所有其他参数,例如以环境为条件的预期奖励作为骚扰参数。 然后,我们制定“传统RS-AIPW战略, 由随机抽样规则(RS) 跟踪目标分配比率和建议规则组成, 使用反概率加权( AIPW) 估计器。 我们提议的RS-AIPW 环境战略是最佳的,因为战略误分辨的可能性上限在预算达到零定值时, 与准偏差的底限值相符。