Low-quality data can cause downstream problems in high-stakes applications. Data-centric approach emphasizes on improving dataset quality to enhance model performance. High-quality datasets are needed for general-purpose Large Language Models (LLMs) training, as well as for domain-specific models, which are usually small in size as it is costly to engage a large number of domain experts for their creation. Thus, it is vital to ensure high-quality domain-specific training data. In this paper, we propose a framework for enhancing the data quality of original datasets. We applied the proposed framework to four biomedical datasets and showed relative improvement of up to 33%/40% for fine-tuning of retrieval/reader models on the BioASQ dataset when using back translation to enhance the original dataset quality.
翻译:低质量数据会导致高风险应用中的问题。数据中心方法侧重于提高数据集质量以增强模型性能。通用大型语言模型(LLM)训练需要高质量的数据集,而专业领域的模型通常体积很小,因为动用大量领域专家进行构建成本高昂。因此,确保高质量的特定领域训练数据非常重要。在本文中,我们提出了一个增强原始数据集质量的框架。我们将所提出的框架应用于四个生物医学数据集,并在生物ASQ数据集上显示使用回译来增强原始数据集质量时检索/阅读器模型的微调相对改进达到了33%/ 40%。