Over the recent years, large pretrained language models (LM) have revolutionized the field of natural language processing (NLP). However, while pretraining on general language has been shown to work very well for common language, it has been observed that niche language poses problems. In particular, climate-related texts include specific language that common LMs can not represent accurately. We argue that this shortcoming of today's LMs limits the applicability of modern NLP to the broad field of text processing of climate-related texts. As a remedy, we propose CLIMATEBERT, a transformer-based language model that is further pretrained on over 2 million paragraphs of climate-related texts, crawled from various sources such as common news, research articles, and climate reporting of companies. We find that CLIMATEBERT leads to a 48% improvement on a masked language model objective which, in turn, leads to lowering error rates by 3.57% to 35.71% for various climate-related downstream tasks like text classification, sentiment analysis, and fact-checking.
翻译:近年来,大量预先培训的语言模型(LM)使自然语言处理领域发生了革命性的变化(NLP),然而,虽然一般语言的预培训已证明对通用语言非常有效,但发现特殊语言带来了问题,尤其是与气候有关的文本包括了共同语言模型无法准确代表的具体语言。我们争辩说,当今语言模型的这一缺陷限制了现代语言模型对与气候有关的文本处理的广泛领域的适用性。作为一种补救措施,我们提议CLIMATEBERT, 一种基于变压器的语言模型,它从各种来源,如共同新闻、研究文章和公司气候报告中,对200多万段与气候有关的文本进行了进一步预先培训。我们发现,CLIMATEBERT在隐蔽语言模型目标上实现了48%的改进,这反过来又导致与气候相关的各种下游任务,如文本分类、情绪分析和事实核对等,误差率下降3.57%至35.71%。