We consider a special case of bandit problems, namely batched bandits. Motivated by natural restrictions of recommender systems and e-commerce platforms, we assume that a learning agent observes responses batched in groups over a certain time period. Unlike previous work, we consider a more practically relevant batch-centric scenario of batch learning. We provide a policy-agnostic regret analysis and demonstrate upper and lower bounds for the regret of a candidate policy. Our main theoretical results show that the impact of batch learning can be measured in terms of online behavior. Finally, we demonstrate the consistency of theoretical results by conducting empirical experiments and reflect on the optimal batch size choice.
翻译:我们考虑的是土匪问题的特殊案例,即分批匪盗。受推荐人制度和电子商务平台自然限制的驱使,我们假设学习代理人会观察在一定时期内分批处理的响应。与以前的工作不同,我们考虑的是更实际的分批学习分批处理方案。我们提供政策上和下级的遗憾分析,并表明对候选政策的遗憾。我们的主要理论结果表明,分批学习的影响可以用在线行为来衡量。最后,我们通过经验实验和思考最佳分批规模选择,显示了理论结果的一致性。