Contextual bandit has been widely used for sequential decision-making based on the current contextual information and historical feedback data. In modern applications, such context format can be rich and can often be formulated as a matrix. Moreover, while existing bandit algorithms mainly focused on reward-maximization, less attention has been paid to the statistical inference. To fill in these gaps, in this work we consider a matrix contextual bandit framework where the true model parameter is a low-rank matrix, and propose a fully online procedure to simultaneously make sequential decision-making and conduct statistical inference. The low-rank structure of the model parameter and the adaptivity nature of the data collection process makes this difficult: standard low-rank estimators are not fully online and are biased, while existing inference approaches in bandit algorithms fail to account for the low-rankness and are also biased. To address these, we introduce a new online doubly-debiasing inference procedure to simultaneously handle both sources of bias. In theory, we establish the asymptotic normality of the proposed online doubly-debiased estimator and prove the validity of the constructed confidence interval. Our inference results are built upon a newly developed low-rank stochastic gradient descent estimator and its non-asymptotic convergence result, which is also of independent interest.
翻译:以当前背景信息和历史反馈数据为基础,背景土匪被广泛用于顺序决策。在现代应用中,这种背景格式可以是丰富的,往往可以形成一个矩阵。此外,虽然现有的土匪算法主要侧重于奖励-最大化,但对统计推理的注意却较少。为了填补这些差距,我们在此工作中考虑一个矩阵背景土匪框架,其中真正的模型参数是一个低级矩阵,并提议一个完全在线的程序,同时进行顺序决策和统计推断。在理论中,模型参数的低级别结构以及数据收集进程的适应性使得这一困难重重:标准的低级别天平不是完全在线的,而是偏颇的,而现行土匪算法的误判方法未能说明低级别的原因,而且也是偏颇的。为了解决这些问题,我们采用了一个新的在线双度偏差偏差推理程序,同时处理两种偏差的来源。在理论中,我们确立了拟议的在线双度偏差参数的正常度结构,数据收集过程的适应性性质使得这种困难重重:标准的低级别天平标不是完全在线的,而是偏颇偏差的,在新建立的信任度排序上也证明了我们所建的稳度结果的有效性。