Statistical inference in contextual bandits is complicated by the adaptive, non-i.i.d. nature of the data. A growing body of work has shown that classical least-squares inference may fail under adaptive sampling, and that constructing valid confidence intervals for linear functionals of the model parameter typically requires paying an unavoidable inflation of order $\sqrt{d \log T}$. This phenomenon -- often referred to as the price of adaptivity -- highlights the inherent difficulty of reliable inference under general contextual bandit policies. A key structural property that circumvents this limitation is the \emph{stability} condition of Lai and Wei, which requires the empirical feature covariance to concentrate around a deterministic limit. When stability holds, the ordinary least-squares estimator satisfies a central limit theorem, and classical Wald-type confidence intervals -- designed for i.i.d. data -- become asymptotically valid even under adaptation, \emph{without} incurring the $\sqrt{d \log T}$ price of adaptivity. In this paper, we propose and analyze a penalized EXP4 algorithm for linear contextual bandits. Our first main result shows that this procedure satisfies the Lai--Wei stability condition and therefore admits valid Wald-type confidence intervals for linear functionals. Our second result establishes that the same algorithm achieves regret guarantees that are minimax optimal up to logarithmic factors, demonstrating that stability and statistical efficiency can coexist within a single contextual bandit method. Finally, we complement our theory with simulations illustrating the empirical normality of the resulting estimators and the sharpness of the corresponding confidence intervals.
翻译:上下文赌博机中的统计推断因数据的自适应性和非独立同分布特性而变得复杂。越来越多的研究表明,经典最小二乘推断在自适应采样下可能失效,且为模型参数的线性泛函构建有效置信区间通常需要付出不可避免的 $\sqrt{d \log T}$ 阶膨胀代价。这一现象——常被称为适应性代价——凸显了在一般上下文赌博机策略下进行可靠推断的内在困难。赖与魏提出的稳定性条件是一种能规避此限制的关键结构性质,它要求经验特征协方差矩阵收敛于确定性极限。当稳定性条件成立时,普通最小二乘估计量满足中心极限定理,且为独立同分布数据设计的经典Wald型置信区间即使在自适应采样下也能渐近有效,而无需付出 $\sqrt{d \log T}$ 的适应性代价。本文提出并分析了一种用于线性上下文赌博机的惩罚化EXP4算法。我们的第一个主要结果表明,该算法满足赖-魏稳定性条件,因此可为线性泛函构建有效的Wald型置信区间。第二个结果证明同一算法能以对数因子内的极小极大最优性实现遗憾保证,这表明稳定性与统计效率可在单一上下文赌博机方法中共存。最后,我们通过仿真实验补充理论分析,展示了所得估计量的经验正态性及相应置信区间的锐度。