This paper introduces the informational multi-armed bandit (IMAB) model in which at each round, a player chooses an arm, observes a symbol, and receives an unobserved reward in the form of the symbol's self-information. Thus, the expected reward of an arm is the Shannon entropy of the probability mass function of the source that generates its symbols. The player aims to maximize the expected total reward associated with the entropy values of the arms played. Under the assumption that the alphabet size is known, two UCB-based algorithms are proposed for the IMAB model which consider the biases of the plug-in entropy estimator. The first algorithm optimistically corrects the bias term in the entropy estimation. The second algorithm relies on data-dependent confidence intervals that adapt to sources with small entropy values. Performance guarantees are provided by upper bounding the expected regret of each of the algorithms. Furthermore, in the Bernoulli case, the asymptotic behavior of these algorithms is compared to the Lai-Robbins lower bound for the pseudo regret. Additionally, under the assumption that the \textit{exact} alphabet size is unknown, and instead the player only knows a loose upper bound on it, a UCB-based algorithm is proposed, in which the player aims to reduce the regret caused by the unknown alphabet size in a finite time regime. Numerical results illustrating the expected regret of the algorithms presented in the paper are provided.
翻译:本文介绍信息多臂匪盗( IMAB) 模式, 每一回合, 玩家选择一个手臂, 观察一个符号, 并获得一个以符号自我信息形式呈现的无观测的奖赏。 因此, 一个手臂的预期奖赏是生成符号的来源的概率质量函数的香农 英特罗比。 玩家的目标是最大限度地增加与所播放的手臂的增缩值相关的预期总奖赏。 在字母大小为已知的假设下, 为IMAB模型提出了两个基于 UCB 的算法, 该模型考虑插件天花板估计的偏差。 第一个算法乐观地纠正了符号自我评估中的偏差术语。 因此, 第二个算法的预期奖赏是依靠数据依赖的信任间隔, 以生成其符号的源码小的源码。 执行保证是通过对每个算法的预期遗憾的上框框。 此外, 在伯尔纽利利利特利案中, 这些算法的偏差行为与Lai- Robbins 的较低边框比较, 以伪悔。 此外, 在假设下, 上, 一个未知的页页页的页的页的页的页的页是未知的上,, 上, 的页形的页眉图是一个未知的页眉, 。