The advancements of neural dialogue generation models show promising results on modeling short-text conversations. However, training such models usually needs a large-scale high-quality dialogue corpus, which is hard to access. In this paper, we present a large-scale cleaned Chinese conversation dataset, LCCC, which contains a base version (6.8million dialogues) and a large version (12.0 million dialogues). The quality of our dataset is ensured by a rigorous data cleaning pipeline, which is built based on a set of rules and a classifier that is trained on manually annotated 110K dialogue pairs. We also release pre-training dialogue models which are trained on LCCC-base and LCCC-large respectively. The cleaned dataset and the pre-training models will facilitate the research of short-text conversation modeling. All the models and datasets are available at https://github.com/thu-coai/CDial-GPT.
翻译:神经对话生成模型的进步显示了短期对话模型的可喜结果,然而,培训这类模型通常需要大规模的高质量对话软件,很难获取。本文介绍大规模清洁的中国对话数据集LCCC,其中包含一个基础版本(680万个对话)和一个大版本(12 000万个对话)。我们的数据元的质量通过严格的数据清理管道得到保障,该管道建立在一套规则的基础上,并有一个分类器,以人工加注的110K对对话为基础,我们还发布了培训前对话模型,分别接受LCCC-Base和LCCC-大LCCC的培训。清洁数据集和预培训模型将促进短文本对话模型的研究。所有模型和数据集都可在https://github.com/thu-coai/CDial-GPT上查阅。