Although current state-of-the-art language models have achieved impressive results in numerous natural language processing tasks, still they could not solve the problem of producing repetitive, dull and sometimes inconsistent text in open-ended text generation. Studies often attribute this problem to the maximum likelihood training objective, and propose alternative approaches by using stochastic decoding methods or altering the training objective. However, there is still a lack of consistent evaluation metrics to directly compare the efficacy of these solutions. In this work, we study different evaluation metrics that have been proposed to evaluate quality, diversity and consistency of machine-generated text. From there, we propose a practical pipeline to evaluate language models in open-ended generation task, and research on how to improve the model's performance in all dimensions by leveraging different auxiliary training objectives.
翻译:尽管目前最先进的语言模式在许多自然语言处理任务方面取得了令人印象深刻的成果,但仍然无法解决在不限名额的文本生成中产生重复、枯燥和有时不一致的文本的问题。研究往往将这一问题归结为最大的可能的培训目标,并通过使用随机解码方法或改变培训目标提出替代方法。然而,仍然缺乏一致的评价指标来直接比较这些解决方案的效力。在这项工作中,我们研究了为评价机器产生的文本的质量、多样性和一致性而提出的不同评价指标。我们从中提出了一个实用的管道,用以评价不限名额的版本中的语言模型,并研究如何利用不同的辅助培训目标来改进模型各方面的绩效。