For researchers leveraging Large-Language Models (LLMs) in the generation of training datasets, especially for conversational recommender systems - the absence of robust evaluation frameworks has been a long-standing problem. The efficiency brought about by LLMs in the data generation phase is impeded during the process of evaluation of the generated data, since it generally requires human-raters to ensure that the data generated is of high quality and has sufficient diversity. Since the quality of training data is critical for downstream applications, it is important to develop metrics that evaluate the quality holistically and identify biases. In this paper, we present a framework that takes a multi-faceted approach towards evaluating datasets produced by generative models and discuss the advantages and limitations of various evaluation methods.
翻译:对于利用大语言模型(LLMs)生成培训数据集的研究人员来说,特别是在对话建议系统方面,缺乏强有力的评价框架是一个长期存在的问题,在数据生成阶段,LLMs带来的效率在数据生成阶段的评价过程中受到阻碍,因为一般要求人类用户确保生成的数据质量高,具有足够的多样性。由于培训数据的质量对下游应用至关重要,因此必须制定全面评价质量和查明偏差的衡量标准。在本文件中,我们提出了一个框架,以多面的方法评价由基因模型产生的数据集,并讨论各种评价方法的优点和局限性。