In this paper, we release a largest ever medical Question Answering (QA) dataset with 26 million QA pairs. We benchmark many existing approaches in our dataset in terms of both retrieval and generation. Experimental results show that the existing models perform far lower than expected and the released dataset is still challenging in the pre-trained language model era. Moreover, we also experimentally show the benefit of the proposed dataset in many aspects: (i) trained models for other QA datasets in a zero-shot fashion; and (ii) as external knowledge for retrieval-augmented generation (RAG); and (iii) improving existing pre-trained language models by using the QA pairs as a pre-training corpus in continued training manner. We believe that this dataset will not only contribute to medical research but also facilitate both the patients and clinical doctors. See \url{https://github.com/FreedomIntelligence/Huatuo-26M}.
翻译:在本文中,我们发布了一个有着2,600万个问答对的医疗问答数据集,是目前最大的医疗问答数据集。我们使用已有的方法在数据集中进行了检测和生成的实验。实验结果表明,现有的模型的表现远低于预期,而且在预训练语言模型时这个数据集依然具有挑战性。此外,我们还通过实验证明了这个数据集的多种优势:(i)其他问答数据集的训练模型可以以零-shot的方式访问;(ii)对于嵌入检索的生成(RAG)也可以作为外部知识使用;(iii)通过将QA对作为预训练语料库进行持续训练,改善了现有的预训练语言模型。我们相信这个数据集不仅可以为医疗研究做出贡献,而且还可以方便患者和临床医生。参考 \url{https://github.com/FreedomIntelligence/Huatuo-26M}。