Online convex optimization (OCO) is a widely used framework in online learning. In each round, the learner chooses a decision in some convex set and an adversary chooses a convex loss function, and then the learner suffers the loss associated with their chosen decision. However, in many of the motivating applications the loss of the learner depends not only on the current decision but on the entire history of decisions until that point. The OCO framework and existing generalizations thereof fail to capture this. In this work we introduce a generalization of the OCO framework, ``Online Convex Optimization with Unbounded Memory'', that captures long-term dependence on past decisions. We introduce the notion of $p$-effective memory capacity, $H_p$, that quantifies the maximum influence of past decisions on current losses. We prove a $O(\sqrt{H_1 T})$ policy regret bound and a stronger $O(\sqrt{H_p T})$ policy regret bound under mild additional assumptions. These bounds are optimal in terms of their dependence on the time horizon $T$. We show the broad applicability of our framework by using it to derive regret bounds, and to simplify existing regret bound derivations, for a variety of online learning problems including an online variant of performative prediction and online linear control.
翻译:在线 convex 优化( OCO) 是在线学习中广泛使用的框架 。 在每回合中, 学习者在某个 convex 设置中选择一个决定, 对手则选择一个 convex 损失功能, 然后学习者遭受与其选择的决定相关的损失。 但是, 在很多激励应用程序中, 学习者损失不仅取决于当前决定, 也取决于直到该点为止的整个决定历史 。 OCO 框架及其现有的概括性未能抓住这一点 。 在这项工作中, 我们引入了对 OCO 框架的概括化, “ 在线 Convex 优化与未受约束的记忆”, 从而捕捉到对过去决定的长期依赖性 。 我们引入了美元有效存储能力的概念, $H_ p$, 从而量化了过去决定对当前损失的最大影响 。 我们证明$O( sqrt{H_ 1 T} ) 政策有悔恨约束, 而美元(sqralive) 政策则在温和的额外假设下被约束 。 这些界限是最佳的, 其广泛依赖性的在线理解框架, 包括简化了我们目前对时间框架 的 的排序 。