How do you incentivize self-interested agents to $\textit{explore}$ when they prefer to $\textit{exploit}$ ? We consider complex exploration problems, where each agent faces the same (but unknown) MDP. In contrast with traditional formulations of reinforcement learning, agents control the choice of policies, whereas an algorithm can only issue recommendations. However, the algorithm controls the flow of information, and can incentivize the agents to explore via information asymmetry. We design an algorithm which explores all reachable states in the MDP. We achieve provable guarantees similar to those for incentivizing exploration in static, stateless exploration problems studied previously.
翻译:如何鼓励自我感兴趣的代理人使用$\textit{explore}$?我们考虑复杂的勘探问题,每个代理人都面临相同的(但未知的)MDP。与传统的强化学习模式相反,代理人控制着政策的选择,而算法只能发布建议。然而,算法控制了信息流动,可以激励代理人通过信息不对称来探索。我们设计了一种算法,探索了MDP中所有可以达到的状态。我们实现了类似于鼓励探索先前研究的静态、无国籍勘探问题的可靠保障。