In nonstationary bandit learning problems, the decision-maker must continually gather information and adapt their action selection as the latent state of the environment evolves. In each time period, some latent optimal action maximizes expected reward under the environment state. We view the optimal action sequence as a stochastic process, and take an information-theoretic approach to analyze attainable performance. We bound limiting per-period regret in terms of the entropy rate of the optimal action process. The bound applies to a wide array of problems studied in the literature and reflects the problem's information structure through its information-ratio.
翻译:在非固定的土匪学习问题中,决策者必须不断收集信息,并随着环境潜在状态的演变调整其行动选择。在每一时期,一些潜在的最佳行动会最大限度地增加环境状态下的预期回报。我们认为,最佳行动顺序是一个随机过程,采取信息理论方法来分析可实现的绩效。从最佳行动过程的增温率来看,我们限制每个时期的遗憾。约束适用于文献中研究的范围广泛的问题,并通过信息范围反映问题的信息结构。