To advance Chinese financial natural language processing (NLP), we introduce BBT-FinT5, a new Chinese financial pre-training language model based on the T5 model. To support this effort, we have built BBT-FinCorpus, a large-scale financial corpus with approximately 300GB of raw text from four different sources. In general domain NLP, comprehensive benchmarks like GLUE and SuperGLUE have driven significant advancements in language model pre-training by enabling head-to-head comparisons among models. Drawing inspiration from these benchmarks, we propose BBT-CFLEB, a Chinese Financial Language understanding and generation Evaluation Benchmark, which includes six datasets covering both understanding and generation tasks. Our aim is to facilitate research in the development of NLP within the Chinese financial domain. Our model, corpus and benchmark are released at https://github.com/ssymmetry/BBT-FinCUGE-Applications. Our work belongs to the Big Bang Transformer (BBT), a large-scale pre-trained language model project.
翻译:为了推进中国金融自然语言处理(NLP),我们引入了BBT-FinT5, 这是一种基于T5模式的新的中国金融培训前语言模式;为了支持这一努力,我们建立了BBT-FinCorpus,这是一个大型金融实体,有四个不同来源的约300GB的原始文本;一般而言,在NLP领域,GLUE和SuperGLLUE等综合基准,通过对各种模型进行头头对头比较,推动了语言模式培训前的显著进展;根据这些基准的灵感,我们建议BBT-CFLEB, 一种中国金融语言理解和生成评估基准,包括六个数据集,涵盖理解和生成任务;我们的目标是促进在中国金融领域开发NLP的研究;我们的模型、要素和基准在https://github.com/ssymatriction/BBBT-FinCUGEGE-Appications。我们的工作属于BBT,这是一个大规模的预先培训语言模型项目。