In the stochastic linear contextual bandit setting there exist several minimax procedures for exploration with policies that are reactive to the data being acquired. In practice, there can be a significant engineering overhead to deploy these algorithms, especially when the dataset is collected in a distributed fashion or when a human in the loop is needed to implement a different policy. Exploring with a single non-reactive policy is beneficial in such cases. Assuming some batch contexts are available, we design a single stochastic policy to collect a good dataset from which a near-optimal policy can be extracted. We present a theoretical analysis as well as numerical experiments on both synthetic and real-world datasets.
翻译:在随机线性线性背景土匪设置中,有若干小型勘探程序,其政策对获取的数据具有反应作用,在实践中,可能存在大量的工程间接费用来部署这些算法,特别是当数据集是以分布式方式收集的,或者当需要某个人在循环中执行不同政策时。在这种情况下,采用单一的非反应政策进行探索是有益的。假设存在某些批量情况,我们设计一个单一的随机政策来收集好数据集,从中提取接近最佳的政策。我们提出了关于合成和真实世界数据集的理论分析以及数字实验。