A key feature of sequential decision making under uncertainty is a need to balance between exploiting--choosing the best action according to the current knowledge, and exploring--obtaining information about values of other actions. The multi-armed bandit problem, a classical task that captures this trade-off, served as a vehicle in machine learning for developing bandit algorithms that proved to be useful in numerous industrial applications. The active inference framework, an approach to sequential decision making recently developed in neuroscience for understanding human and animal behaviour, is distinguished by its sophisticated strategy for resolving the exploration-exploitation trade-off. This makes active inference an exciting alternative to already established bandit algorithms. Here we derive an efficient and scalable approximate active inference algorithm and compare it to two state-of-the-art bandit algorithms: Bayesian upper confidence bound and optimistic Thompson sampling. This comparison is done on two types of bandit problems: a stationary and a dynamic switching bandit. Our empirical evaluation shows that the active inference algorithm does not produce efficient long-term behaviour in stationary bandits. However, in the more challenging switching bandit problem active inference performs substantially better than the two state-of-the-art bandit algorithms. The results open exciting venues for further research in theoretical and applied machine learning, as well as lend additional credibility to active inference as a general framework for studying human and animal behaviour.
翻译:在不确定的情况下进行顺序决策的一个关键特征是,需要平衡兼顾根据现有知识利用选择最佳行动的最佳行动与探索其他行动的价值信息之间的平衡。多武装土匪问题是一个古典任务,它捕捉了这种交易,成为机器学习发展土匪算法的工具,在许多工业应用中证明是有益的。积极的推论框架,即神经科学中最近为了解人类和动物行为而开发的顺序决策方法,其特点是其解决勘探-开发交易的复杂战略。这使得积极推论成为已经建立的土匪算法的一种令人兴奋的替代物。在这里,我们产生了一种高效和可伸缩的大致积极推算法,并把它与两种最先进的土匪算法进行比较:巴伊西亚高层信心约束和乐观的汤普采样。这种比较是针对两类土匪问题进行的:一个固定和动态转换的土匪行为。我们的经验评估表明,积极推算法在固定土匪中并不产生有效的长期行为。然而,在更具有挑战性的部落结构化的逻辑分析中,作为积极研究的机率研究,在两个机动的轨道上进行了更深入的学习。