We consider the intrinsic evaluation of neural generative dialog models through the lens of Grice's Maxims of Conversation (1975). Based on the maxim of Quantity (be informative), we propose Relative Utterance Quantity (RUQ) to diagnose the `I don't know' problem, in which a dialog system produces generic responses. The linguistically motivated RUQ diagnostic compares the model score of a generic response to that of the reference response. We find that for reasonable baseline models, `I don't know' is preferred over the reference the majority of the time, but this can be reduced to less than 5% with hyperparameter tuning. RUQ allows for the direct analysis of the `I don't know' problem, which has been addressed but not analyzed by prior work.
翻译:我们从Grice's Maxims of Conversation(1975年)的角度来考虑对神经基因对话模型的内在评估。根据数量(信息量)的上限(信息量),我们建议相对偏差数量(RUQ)来诊断`我不知道'的问题,其中对话系统产生通用反应。语言动机的RUQ诊断将通用反应的模型评分与参考反应的模型评分进行比较。我们认为,对于合理的基准模型来说,“我不知道”优于大多数时间的参考值,但可以通过超分调降低到5%以下。RUQ允许直接分析“我不知道”的问题,这个问题已经处理过,但没有通过先前的工作加以分析。