We consider a special case of bandit problems, named batched bandits, in which an agent observes batches of responses over a certain time period. Unlike previous work, we consider a practically relevant batch-centric scenario of batch learning. That is to say, we provide a policy-agnostic regret analysis and demonstrate upper and lower bounds for the regret of a candidate policy. Our main theoretical results show that the impact of batch learning can be measured proportional to the regret of online behavior. Primarily, we study two settings of the problem: instance-independent and instance-dependent. While the upper bound is the same for both settings, the worst-case lower bound is more comprehensive in the former case and more accurate in the latter one. Also, we provide a more robust result for the 2-armed bandit problem as an important insight. Finally, we demonstrate the consistency of theoretical results by conducting empirical experiments and reflect on the optimal batch size choice.
翻译:我们考虑的是土匪问题的特殊案例,称为分批强盗,其中一名代理人在一定的时期内观察了一系列反应。与以前的工作不同,我们考虑的是实际相关的分批学习的批量中心情景。也就是说,我们提供政策上不可知的遗憾分析,并展示对候选政策遗憾的上下界限。我们的主要理论结果表明,批量学习的影响可以与在线行为的遗憾程度相称。我们主要研究的是问题的两个背景:独立案例和依赖案例。虽然上界限对两种环境都是一样的,但最坏的下界限在前一种情况下更为全面,在后一种情况下更为准确。此外,我们为2架土匪问题提供了更强有力的结果,作为重要的洞察力。最后,我们通过进行实验实验和思考最佳批量选择,来证明理论结果的一致性。