A dialogue policy module is an essential part of task-completion dialogue systems. Recently, increasing interest has focused on reinforcement learning (RL)-based dialogue policy. Its favorable performance and wise action decisions rely on an accurate estimation of action values. The overestimation problem is a widely known issue of RL since its estimate of the maximum action value is larger than the ground truth, which results in an unstable learning process and suboptimal policy. This problem is detrimental to RL-based dialogue policy learning. To mitigate this problem, this paper proposes a dynamic partial average estimator (DPAV) of the ground truth maximum action value. DPAV calculates the partial average between the predicted maximum action value and minimum action value, where the weights are dynamically adaptive and problem-dependent. We incorporate DPAV into a deep Q-network as the dialogue policy and show that our method can achieve better or comparable results compared to top baselines on three dialogue datasets of different domains with a lower computational load. In addition, we also theoretically prove the convergence and derive the upper and lower bounds of the bias compared with those of other methods.
翻译:对话政策模块是任务完成对话系统的一个基本部分。 最近,人们日益关注的焦点是强化学习(RL)基于对话的政策。 其有利的业绩和明智的行动决定依赖于对行动值的准确估计。 高估是一个众所周知的RL问题,因为它对最大行动值的估计大于地面真理,导致学习过程不稳定,政策不最优化。 这个问题不利于基于RL的对话政策学习。 为缓解这一问题,本文件提议了一个动态的地面真理最大行动值部分平均估计器(DPAV)。 DPAV计算了预测的最大行动值和最低行动值之间的部分平均值,因为其加权是动态适应性和取决于问题。 我们把DPAV纳入一个深度的Q网络,作为对话政策,并表明我们的方法可以比不同领域三个对话数据集的顶点取得更好或可比的结果,而计算负荷较低。 此外,我们从理论上证明,与其它方法相比,偏差的上下界限是一致的。