Scientific literature serves as a high-quality corpus, supporting a lot of Natural Language Processing (NLP) research. However, existing datasets are centered around the English language, which restricts the development of Chinese scientific NLP. In this work, we present CSL, a large-scale Chinese Scientific Literature dataset, which contains the titles, abstracts, keywords and academic fields of 396k papers. To our knowledge, CSL is the first scientific document dataset in Chinese. The CSL can serve as a Chinese corpus. Also, this semi-structured data is a natural annotation that can constitute many supervised NLP tasks. Based on CSL, we present a benchmark to evaluate the performance of models across scientific domain tasks, i.e., summarization, keyword generation and text classification. We analyze the behavior of existing text-to-text models on the evaluation tasks and reveal the challenges for Chinese scientific NLP tasks, which provides a valuable reference for future research. Data and code are available at https://github.com/ydli-ai/CSL
翻译:科学文献是高质量的文献,支持许多自然语言处理(NLP)研究,但现有数据集以英语为中心,限制了中国科学自然处理(NLP)的发展。在这项工作中,我们介绍了中国科学文献(CSL),这是一个大型的中国科学文献数据集,包含396k论文的书名、摘要、关键词和学术领域。据我们所知,CSL是首个中国科学文件数据集。CSL可以作为中国文献库。此外,这种半结构化数据是一种自然说明,可以构成许多受监督的自然说明,构成许多受监管的自然说明任务。根据CLSL,我们提出了评估跨科学领域任务模型性能的基准,即总结、关键词生成和文本分类。我们分析了关于评估任务的现有文本到文本模型的行为,并揭示了中国科学自然处理任务的挑战,为未来研究提供了宝贵的参考。数据和代码见https://github.com/ydli-ai/CSL。