Motivated by the human-machine interaction such as training chatbots for improving customer satisfaction, we study human-guided human-machine interaction involving private information. We model this interaction as a two-player turn-based game, where one player (Alice, a human) guides the other player (Bob, a machine) towards a common goal. Specifically, we focus on offline reinforcement learning (RL) in this game, where the goal is to find a policy pair for Alice and Bob that maximizes their expected total rewards based on an offline dataset collected a priori. The offline setting presents two challenges: (i) We cannot collect Bob's private information, leading to a confounding bias when using standard RL methods, and (ii) a distributional mismatch between the behavior policy used to collect data and the desired policy we aim to learn. To tackle the confounding bias, we treat Bob's previous action as an instrumental variable for Alice's current decision making so as to adjust for the unmeasured confounding. We develop a novel identification result and use it to propose a new off-policy evaluation (OPE) method for evaluating policy pairs in this two-player turn-based game. To tackle the distributional mismatch, we leverage the idea of pessimism and use our OPE method to develop an off-policy learning algorithm for finding a desirable policy pair for both Alice and Bob. Finally, we prove that under mild assumptions such as partial coverage of the offline data, the policy pair obtained through our method converges to the optimal one at a satisfactory rate.
翻译:受人机互动的驱动,例如为提高客户满意度而培训闲聊爱好者等,我们研究由人指导的涉及私人信息的人类机器互动。我们把这种互动模拟作为双玩者翻转游戏的模式,在这个游戏中,一个玩家(爱丽丝,一个人)引导另一个玩家(鲍勃,一个机器)走向一个共同目标。具体地说,我们注重在这个游戏中进行离线强化学习(RL),其目标是为爱丽丝和鲍勃找到一对政策配对,在离线数据集的基础上,最大限度地实现预期的总回报。离线设置带来了两个挑战:(一) 我们无法收集鲍勃的私人信息,导致在使用标准的 RL 方法时出现一种折合的偏向偏向偏向偏向的偏向。具体地,我们把Bob的先前行动当作一个工具变量,用于爱丽丝当前决策的调整,以适应未测得的对等。我们开发了新式的识别结果,并用它来提出一个新的离线政策缩策略评价方法,在使用双向双向双向方向上,我们用一种选择的游戏政策推算方法来,用一种双向双向方向分析。