Existing reference-free metrics have obvious limitations for evaluating controlled text generation models. Unsupervised metrics can only provide a task-agnostic evaluation result which correlates weakly with human judgments, whereas supervised ones may overfit task-specific data with poor generalization ability to other datasets. In this paper, we propose an unsupervised reference-free metric called CTRLEval, which evaluates controlled text generation from different aspects by formulating each aspect into multiple text infilling tasks. On top of these tasks, the metric assembles the generation probabilities from a pre-trained language model without any model training. Experimental results show that our metric has higher correlations with human judgments than other baselines, while obtaining better generalization of evaluating generated texts from different models and with different qualities.
翻译:现有的无参考指标在评价受控文本生成模型方面有着明显的局限性。 无监督指标只能提供与人类判断关系不大的任务不可知性评价结果,而受监督指标可能将特定任务数据与其他数据集的概括性能力差而过于匹配。 在本文中,我们建议采用一个不受监督的无参考指标CTRLEval,通过将每个方面纳入多文本填充任务,从不同方面评价受控文本生成。除了这些任务外,该指标将未经任何示范培训的预先培训语言模型的生成概率集中起来。 实验结果显示,我们的衡量标准与其他基线相比,与人类判断的相关性更高,同时对不同模式和不同品质的生成文本进行更好的评价。