Exploration is a crucial aspect of bandit and reinforcement learning algorithms. The uncertainty quantification necessary for exploration often comes from either closed-form expressions based on simple models or resampling and posterior approximations that are computationally intensive. We propose instead an approximate exploration methodology based on fitting only two point estimates, one tuned and one overfit. The approach, which we term the residual overfit method of exploration (ROME), drives exploration towards actions where the overfit model exhibits the most overfitting compared to the tuned model. The intuition is that overfitting occurs the most at actions and contexts with insufficient data to form accurate predictions of the reward. We justify this intuition formally from both a frequentist and a Bayesian information theoretic perspective. The result is a method that generalizes to a wide variety of models and avoids the computational overhead of resampling or posterior approximations. We compare ROME against a set of established contextual bandit methods on three datasets and find it to be one of the best performing.
翻译:勘探所需的不确定性量化往往来自基于简单模型的封闭式表达式,或者基于再抽样和后近似,这些表达式是计算密集的。我们提议了一种大约的勘探方法,其基础是只安装两个点的估计数,一个是调整的,一个是超配的。我们称之为剩余超配勘探方法(ROME),该方法将勘探推向行动,而过度装配模型显示的比调整模型更适合的行动。直觉是,在行动和背景中,过度装配的情况最多,数据不足,无法准确预测奖赏。我们从经常使用和巴伊西亚信息理论角度正式证明这种直觉是合理的。其结果是一种方法,它概括了多种模型,避免了再采样或后近似的计算间接。我们把ROME与三套数据集的一套既定背景频谱方法进行比较,发现它是最佳方法之一。