How do you incentivize self-interested agents to $\textit{explore}$ when they prefer to $\textit{exploit}$? We consider complex exploration problems, where each agent faces the same (but unknown) MDP. In contrast with traditional formulations of reinforcement learning, agents control the choice of policies, whereas an algorithm can only issue recommendations. However, the algorithm controls the flow of information, and can incentivize the agents to explore via information asymmetry. We design an algorithm which explores all reachable states in the MDP. We achieve provable guarantees similar to those for incentivizing exploration in static, stateless exploration problems studied previously. To the best of our knowledge, this is the first work to consider mechanism design in a stateful, reinforcement learning setting.
翻译:如何鼓励自我感兴趣的代理商使用$\ textit{ explore} $? 我们考虑复杂的勘探问题,每个代理商都面临相同的(但未知的)MDP。 与传统的强化学习模式相比,代理商控制政策的选择,而算法只能发布建议。 然而,算法控制信息流动,并且能够激励代理商通过信息不对称来探索。 我们设计了一种算法,探索MDP中所有可以达到的状态。 我们实现了类似于先前研究的激励在静态、无国籍的探索问题上进行探索的可靠保障。 据我们所知,这是在有声有色、强化学习环境中考虑机制设计的第一个工作。