We consider a general multi-armed bandit problem with correlated (and simple contextual and restless) elements, as a relaxed control problem. By introducing an entropy premium, we obtain a smooth asymptotic approximation to the value function. This yields a novel semi-index approximation of the optimal decision process, obtained numerically by solving a fixed point problem, which can be interpreted as explicitly balancing an exploration-exploitation trade-off. Performance of the resulting Asymptotic Randomised Control (ARC) algorithm compares favourably with other approaches to correlated multi-armed bandits.
翻译:我们认为,一个与相关(以及简单背景和无休止)元素相关的多武装盗匪问题是一个一般的多武装强盗问题,是一个宽松的控制问题。 通过引入一个通缩溢价,我们获得了对价值函数的顺畅的无症状近似值。 由此产生了对最佳决策流程的新型半指数近似值,它通过解决一个固定点问题(这可以被解释为明确平衡勘探-开发交易 ) 。 由此形成的澳大利亚随机化控制算法的实施优于对相关多武装强盗采取的其他方法。