Decision-makers often face the "many bandits" problem, where one must simultaneously learn across related but heterogeneous contextual bandit instances. For instance, a large retailer may wish to dynamically learn product demand across many stores to solve pricing or inventory problems, making it desirable to learn jointly for stores serving similar customers; alternatively, a hospital network may wish to dynamically learn patient risk across many providers to allocate personalized interventions, making it desirable to learn jointly for hospitals serving similar patient populations. We study the setting where the unknown parameter in each bandit instance can be decomposed into a global parameter plus a sparse instance-specific term. Then, we propose a novel two-stage estimator that exploits this structure in a sample-efficient way by using a combination of robust statistics (to learn across similar instances) and LASSO regression (to debias the results). We embed this estimator within a bandit algorithm, and prove that it improves asymptotic regret bounds in the context dimension $d$; this improvement is exponential for data-poor instances. We further demonstrate how our results depend on the underlying network structure of bandit instances.
翻译:决策者常常面临“许多土匪”问题,其中一个人必须同时从相关但不同背景的土匪事件中学习。例如,一个大零售商可能希望动态地学习许多商店的产品需求,以解决定价或库存问题,从而使为类似客户服务的商店有必要联合学习;另一个医院网络可能希望动态地学习许多供应商的病人风险,以便分配个性化干预措施,使为类似病人口服务的医院有必要联合学习。我们研究每个土匪中未知的参数可以分解成一个全球参数加上一个稀有实例特定术语的设置。然后,我们提出一个新的两阶段估算器,通过使用强有力的统计数据(在类似实例中学习)和LASSO回归(对结果进行偏差)相结合,以抽样高效的方式利用这一结构。我们把这个估计器嵌入了一个土匪算法,并证明它改善了上下文中的微调的遗憾界限 $dd;这一改进对于数据贫乏的例子来说是指数的指数。我们进一步展示我们的结果如何取决于土匪事件的基本网络结构。