Evaluating Natural Language Generation (NLG) outputs is crucial but laborious and expensive. While various automatic NLG assessment methods have been proposed, they often are quite task-specific and have to be engineered with a particular domain and attribute in mind. In this work, we propose a robust zero-shot approach to NLG evaluation using pairwise comparative judgment with open-source Large Language Models (LLMs). The motivation for this approach is that even as humans, it is easier to determine which of two options are better, than it is to independently objectively score each option. We use this insight and leverage the emergent abilities of LLMs, where we probe FlanT5 to determine which of two candidate responses is better, rather than assigning absolute scores. Our results demonstrate that comparative assessment is a more effective approach than absolute scoring, enabling smaller open-source LLMs to achieve comparable performance to larger public access APIs. We evaluate systems on both summary evaluation and dialogue response generation, and show that opensource LLMs can lead to good correlations with human scores for a range of different attributes.
翻译:暂无翻译