We model online recommendation systems using the hidden Markov multi-state restless multi-armed bandit problem. To solve this we present Monte Carlo rollout policy. We illustrate numerically that Monte Carlo rollout policy performs better than myopic policy for arbitrary transition dynamics with no specific structure. But, when some structure is imposed on the transition dynamics, myopic policy performs better than Monte Carlo rollout policy.
翻译:我们用隐藏的Markov多州无休止的多武装强盗问题来模拟在线建议系统。为了解决这个问题,我们提出了蒙特卡洛推出政策。我们用数字来说明蒙特卡洛推出政策的表现比对任意过渡动态的短视政策要好,而没有具体结构。 但是,当某些结构强加在过渡动态上时,短视政策的表现比蒙特卡洛推出政策要好。