We investigate the extent to which offline demonstration data can improve online learning. It is natural to expect some improvement, but the question is how, and by how much? We show that the degree of improvement must depend on the quality of the demonstration data. To generate portable insights, we focus on Thompson sampling (TS) applied to a multi-armed bandit as a prototypical online learning algorithm and model. The demonstration data is generated by an expert with a given competence level, a notion we introduce. We propose an informed TS algorithm that utilizes the demonstration data in a coherent way through Bayes' rule and derive a prior-dependent Bayesian regret bound. This offers insight into how pretraining can greatly improve online performance and how the degree of improvement increases with the expert's competence level. We also develop a practical, approximate informed TS algorithm through Bayesian bootstrapping and show substantial empirical regret reduction through experiments.
翻译:我们调查了离线演示数据能够改善在线学习的程度。 我们自然会期望一些改进,但问题是如何以及在多大程度上改进? 我们表明改进的程度必须取决于演示数据的质量。 为了产生便携式的洞察力,我们侧重于Thompson抽样(TS)应用到多武装的土匪,作为一种典型的在线学习算法和模型。示范数据是由具有特定能力的专家生成的,我们引入了一种概念。我们建议了一个知情的TS算法,通过Bayes规则以一致的方式利用演示数据,并得出一个以前依赖Bayesian人的遗憾约束。这使人们深入了解培训前如何大大提高在线业绩,以及专家能力水平的改进程度如何提高。我们还通过Bayesian的靴子牵引开发了一种实用的、近似于知情的TS算法,并通过实验显示实际的遗憾减少。