Contextual bandit algorithms are essential for solving many real-world interactive machine learning problems. Despite multiple recent successes on statistically and computationally efficient methods, the practical behavior of these algorithms is still poorly understood. We leverage the availability of large numbers of supervised learning datasets to empirically evaluate contextual bandit algorithms, focusing on practical methods that learn by relying on optimization oracles from supervised learning. We find that a recent method (Foster et al., 2018) using optimism under uncertainty works the best overall. A surprisingly close second is a simple greedy baseline that only explores implicitly through the diversity of contexts, followed by a variant of Online Cover (Agarwal et al., 2014) which tends to be more conservative but robust to problem specification by design. Along the way, we also evaluate various components of contextual bandit algorithm design such as loss estimators. Overall, this is a thorough study and review of contextual bandit methodology.
翻译:尽管最近统计和计算效率方法取得了多项成功,但这些算法的实际行为仍然不甚为人知。我们利用大量监管的学习数据集对背景土匪算法进行实证性评估,重点是通过依赖优化或从监督的学习中获得的触手法学习的实用方法。我们发现,在不确定性下使用乐观的近期方法(Foster等人,2018年)总体上最为有效。令人惊讶的第二点是简单的贪婪基线,它仅通过背景多样性隐含地探索,然后是在线覆盖的变体(Agarwal等人,2014年),该变体往往比较保守,但通过设计来质疑规格。与此同时,我们还评估了背景土匪算法设计的各个组成部分,如损失估计。总体而言,这是对背景土匪方法的彻底研究和审查。