Conventionally, since the natural language action space is astronomical, approximate dynamic programming applied to dialogue generation involves policy improvement with action sampling. However, such a practice is inefficient for reinforcement learning (RL) because the eligible (high action value) responses are very sparse, and the greedy policy sustained by the random sampling is flabby. This paper shows that the performance of dialogue policy positively correlated with sampling size by theoretical and experimental. We introduce a novel dual-granularity Q-function to alleviate this limitation by exploring the most promising response category to intervene in the sampling. It extracts the actions following the grained hierarchy, which can achieve the optimum with fewer policy iterations. Our approach learns in the way of offline RL from multiple reward functions designed to recognize human emotional details. Empirical studies demonstrate that our algorithm outperforms the baseline methods. Further verification presents that ours can generate responses with higher expected rewards and controllability.
翻译:针对自然语言的庞大行动空间,传统上,对话生成的近似动态规划涉及带有行动抽样的策略改进。然而,这种实践在强化学习中是低效的,因为合格(高行动价值)响应非常稀缺,而由随机抽样维持的贪婪策略效果不佳。本文通过理论和实验表明,对话策略的表现与抽样大小正相关。我们提出了一种新颖的双粒度 Q-函数来缓解这种限制,通过探索最有前途的响应类别来干预采样。它提取遵循粒度层次结构的行动,可以在较少的策略迭代次数内实现最优。我们的方法通过离线强化学习从多个奖励函数中学习,以识别人类情感细节。实证研究表明,我们的算法优于基线方法。进一步验证表明,我们的方法可以生成具有更高预期奖励和可控性的响应。