Language models have steadily increased in size over the past few years. They achieve a high level of performance on various natural language processing (NLP) tasks such as question answering and summarization. Large language models (LLMs) have been used for generation and can now output human-like text. Due to this, there are other downstream tasks in the realm of dialog that can now harness the LLMs' language understanding capabilities. Dialog evaluation is one task that this paper will explore. It concentrates on prompting with LLMs: BLOOM, OPT, GPT-3, Flan-T5, InstructDial and TNLGv2. The paper shows that the choice of datasets used for training a model contributes to how well it performs on a task as well as on how the prompt should be structured. Specifically, the more diverse and relevant the group of datasets that a model is trained on, the better dialog evaluation performs. This paper also investigates how the number of examples in the prompt and the type of example selection used affect the model's performance.
翻译:过去几年来,语言模式在规模上稳步增加,在诸如问答和总结等各种自然语言处理(NLP)任务上取得了高水平的成绩。大型语言模式(LLMs)已经用于生成,现在可以输出类似人的文本。因此,在对话领域还有其他下游任务,现在可以利用LLMs的语言理解能力。对对话的评价是本文将要探讨的一项任务。它侧重于与LLMs(BLOM、OTP、GPT-3、Flan-T5、SportalDial和TNLGv2)的催化。论文表明,用于培训模型的数据集的选择有助于模型如何很好地完成一项任务,以及如何及时安排。具体地说,培训模型的数据集种类更加多样化和切合实际,对话评价的进行得更好。本文还调查了快速实例的数量和所选的例子类型如何影响模型的性能。