The Multi-Armed Bandit problem provides a fundamental framework for analyzing the tension between exploration and exploitation in sequential learning. This paper explores Information Directed Sampling (IDS) policies, a class of heuristics that balance immediate regret against information gain. We focus on the tractable environment of two-state Bernoulli bandits as a minimal model to rigorously compare heuristic strategies against the optimal policy. We extend the IDS framework to the discounted infinite-horizon setting by introducing a modified information measure and a tuning parameter to modulate the decision-making behavior. We examine two specific problem classes: symmetric bandits and the scenario involving one fair coin. In the symmetric case we show that IDS achieves bounded cumulative regret, whereas in the one-fair-coin scenario the IDS policy yields a regret that scales logarithmically with the horizon, in agreement with classical asymptotic lower bounds. This work serves as a pedagogical synthesis, aiming to bridge concepts from reinforcement learning and information theory for an audience of statistical physicists.
翻译:多臂赌博机问题为分析序列学习中探索与利用之间的权衡提供了基本框架。本文探讨信息导向采样策略,这是一类平衡即时遗憾与信息增益的启发式方法。我们聚焦于双状态伯努利赌博机这一可处理环境,将其作为严格比较启发式策略与最优策略的最小模型。通过引入修正的信息度量与调节决策行为的调谐参数,我们将IDS框架扩展至折扣无限时域设定。我们研究了两类具体问题:对称赌博机场景与涉及一枚公平硬币的情景。在对称情形中,我们证明IDS能实现有界累积遗憾;而在单公平硬币场景中,IDS策略产生的遗憾随时域呈对数增长,这与经典渐近下界相一致。本工作作为教学性综述,旨在为统计物理学者群体搭建强化学习与信息理论概念之间的桥梁。