Contextual sequential decision problems with categorical or numerical observations are ubiquitous and Generalized Linear Bandits (GLB) offer a solid theoretical framework to address them. In contrast to the case of linear bandits, existing algorithms for GLB have two drawbacks undermining their applicability. First, they rely on excessively pessimistic concentration bounds due to the non-linear nature of the model. Second, they require either non-convex projection steps or burn-in phases to enforce boundedness of the estimators. Both of these issues are worsened when considering non-stationary models, in which the GLB parameter may vary with time. In this work, we focus on self-concordant GLB (which include logistic and Poisson regression) with forgetting achieved either by the use of a sliding window or exponential weights. We propose a novel confidence-based algorithm for the maximum-likehood estimator with forgetting and analyze its perfomance in abruptly changing environments. These results as well as the accompanying numerical simulations highlight the potential of the proposed approach to address non-stationarity in GLB.
翻译:直线或数字观测的上下文顺序决定问题是无处不在的和通用的线性强盗(GLB),它提供了一个解决这些问题的坚实理论框架。与线性强盗的情况相反,现有GLB的算法有两个缺点,损害其适用性。首先,由于模型的非线性性质,它们依赖过于悲观的浓度界限。第二,它们要求非电文预测步骤或燃烧阶段,以强制测量仪的界限。在考虑非静止模型时,这两个问题都恶化了,因为GLB参数可能随时间变化。在这项工作中,我们注重自和谐的GLB(包括后勤和Poisson回归),而忘记了使用滑动窗口或指数重量所实现的GLB。我们为最大类似估计仪提出了一种新的基于信任的算法,其中遗忘并分析其在突变环境中的渗透性。这些结果以及随附的数字模拟突出了拟议方法解决GLB非稳定性的可能性。