We study a finite-horizon restless multi-armed bandit problem with multiple actions, dubbed R(MA)^2B. The state of each arm evolves according to a controlled Markov decision process (MDP), and the reward of pulling an arm depends on both the current state of the corresponding MDP and the action taken. The goal is to sequentially choose actions for arms so as to maximize the expected value of the cumulative rewards collected. Since finding the optimal policy is typically intractable, we propose a computationally appealing index policy which we call Occupancy-Measured-Reward Index Policy. Our policy is well-defined even if the underlying MDPs are not indexable. We prove that it is asymptotically optimal when the activation budget and number of arms are scaled up, while keeping their ratio as a constant. For the case when the system parameters are unknown, we develop a learning algorithm. Our learning algorithm uses the principle of optimism in the face of uncertainty and further uses a generative model in order to fully exploit the structure of Occupancy-Measured-Reward Index Policy. We call it the R(MA)^2B-UCB algorithm. As compared with the existing algorithms, R(MA)^2B-UCB performs close to an offline optimum policy, and also achieves a sub-linear regret with a low computational complexity. Experimental results show that R(MA)^2B-UCB outperforms the existing algorithms in both regret and run time.
翻译:我们用多种行动研究一个不固定的松散多武装强盗问题,称为R(MA)2B。每个臂的状态根据一个受控的Markov决定程序(MDP)而演变,拉动手臂的奖励取决于相应的MDP和采取的行动的当前状况。目标是按顺序选择武器行动,以便尽可能扩大累积奖励的预期价值。由于找到最佳政策通常是棘手的,我们提议了一个具有计算吸引力的指数政策,我们称之为Occupacy-Measured-Reward Index Policy。即使基本的MDP无法索引化,我们的政策也是很明确的。我们证明,当激活预算和武器数量增加时,拉动的奖励是微不足道的,同时保持其比率不变。对于系统参数未知的情况,我们开发了一种学习算法。我们学习算法在面对不确定性时使用乐观原则,并进一步使用一种基因化模型,以便充分利用Occup-Meuriz-Reward Index 政策的结构。我们称,在最佳预算时,我们用RMA2 和现有亚运算法进行一种不精确的亚。我们称为R-MA的亚。我们用亚的亚的算算。我们用亚的算法进行一个比亚的亚的算。