Statistical learning under distributional drift remains insufficiently characterized: when each observation alters the data-generating law, classical generalization bounds can collapse. We introduce a new statistical primitive, the reproducibility budget $C_T$, which quantifies a system's finite capacity for statistical reproducibility - the extent to which its sampling process can remain governed by a consistent underlying distribution in the presence of both exogenous change and endogenous feedback. Formally, $C_T$ is defined as the cumulative Fisher-Rao path length of the coupled learner-environment evolution, measuring the total distributional motion accumulated during learning. From this construct we derive a drift-feedback generalization bound of order $O(T^{-1/2} + C_T/T)$, and we prove a matching minimax lower bound showing that this rate is minimax-optimal. Consequently, the results establish a reproducibility speed limit: no algorithm can achieve smaller worst-case generalization error than that imposed by the average Fisher-Rao drift rate $C_T/T$ of the data-generating process. The framework situates exogenous drift, adaptive data analysis, and performative prediction within a common geometric structure, with $C_T$ emerging as the intrinsic quantity measuring distributional motion across these settings.
翻译:分布漂移下的统计学习仍缺乏充分的理论刻画:当每个观测值都改变数据生成规律时,经典泛化界可能失效。我们引入一种新的统计原语——可重复性预算$C_T$,它量化了系统在有限容量下的统计可重复性,即在存在外生变化和内生反馈时,其采样过程在多大程度上仍能受一致底层分布支配。形式上,$C_T$定义为学习者-环境耦合演化的累积Fisher-Rao路径长度,用于度量学习过程中积累的总分布运动量。基于此构造,我们推导出阶数为$O(T^{-1/2} + C_T/T)$的漂移-反馈泛化界,并证明匹配的极小极大下界,表明该速率是极小极大最优的。因此,这些结果确立了一个可重复性速度极限:任何算法的最坏情况泛化误差都不可能低于数据生成过程的平均Fisher-Rao漂移率$C_T/T$所施加的界限。该框架将外生漂移、自适应数据分析和执行预测置于统一的几何结构中,其中$C_T$作为衡量这些场景下分布运动的内在量涌现出来。