具有线性功能近似的无型无模型RL (Provably Efficient Model-free RL in Leader-Follower MDP with Linear Function Approximation)

We consider a multi-agent episodic MDP setup where an agent (leader) takes action at each step of the episode followed by another agent (follower). The state evolution and rewards depend on the joint action pair of the leader and the follower. Such type of interactions can find applications in many domains such as smart grids, mechanism design, security, and policymaking. We are interested in how to learn policies for both the players with provable performance guarantee under a bandit feedback setting. We focus on a setup where both the leader and followers are {\em non-myopic}, i.e., they both seek to maximize their rewards over the entire episode and consider a linear MDP which can model continuous state-space which is very common in many RL applications. We propose a {\em model-free} RL algorithm and show that $\tilde{\mathcal{O}}(\sqrt{d^3H^3T})$ regret bounds can be achieved for both the leader and the follower, where $d$ is the dimension of the feature mapping, $H$ is the length of the episode, and $T$ is the total number of steps under the bandit feedback information setup. Thus, our result holds even when the number of states becomes infinite. The algorithm relies on {\em novel} adaptation of the LSVI-UCB algorithm. Specifically, we replace the standard greedy policy (as the best response) with the soft-max policy for both the leader and the follower. This turns out to be key in establishing uniform concentration bound for the value functions. To the best of our knowledge, this is the first sub-linear regret bound guarantee for the Markov games with non-myopic followers with function approximation.

翻译：我们考虑的是多试剂的外观 MDP 设置, 代理商( 领导者) 在事件的每一步都采取行动, 由另一个代理商( 追随者) 跟踪者( 追随者) 。国家进化和奖赏取决于领导者和追随者的联合行动。这种类型的互动可以在智能网格、机制设计、安全和决策等许多领域找到应用。我们感兴趣的是如何为两个玩家学习政策, 在土匪反馈设置下, 具有可辨识的性能保障的玩家。我们侧重于一个设置, 领导者和追随者都在其中采取行动, 也就是说, 他们都寻求在整个事件中最大限度地获得回报。国家进化和奖励取决于领导者和追随者的联合行动。国家进化的线性 MDP 可以模拟连续状态空间, 在许多RL应用程序中非常常见。我们提议一个不设模范的 RL 算法, 并显示 $tilde acregial discal 。以美元取代了我们头和尾行的数值, 以美元取代了我们头值的总调。。将的开始方向和方向方向方向的方向的的。在方向上, 方向的的的的的将我们的的的运行的将的的的的的将的的的的的的的的的的的的的都保持到的。