ClimaText:气候变化专题探测数据集 (ClimaText: A Dataset for Climate Change Topic Detection)

Climate change communication in the mass media and other textual sources may affect and shape public perception. Extracting climate change information from these sources is an important task, e.g., for filtering content and e-discovery, sentiment analysis, automatic summarization, question-answering, and fact-checking. However, automating this process is a challenge, as climate change is a complex, fast-moving, and often ambiguous topic with scarce resources for popular text-based AI tasks. In this paper, we introduce \textsc{ClimaText}, a dataset for sentence-based climate change topic detection, which we make publicly available. We explore different approaches to identify the climate change topic in various text sources. We find that popular keyword-based models are not adequate for such a complex and evolving task. Context-based algorithms like BERT \cite{devlin2018bert} can detect, in addition to many trivial cases, a variety of complex and implicit topic patterns. Nevertheless, our analysis reveals a great potential for improvement in several directions, such as, e.g., capturing the discussion on indirect effects of climate change. Hence, we hope this work can serve as a good starting point for further research on this topic.

翻译：大众媒体和其他文本来源的气候变化通信可能会影响和影响公众认识。从这些来源提取气候变化信息是一项重要任务,例如用于过滤内容和电子发现、情绪分析、自动总结、问答和事实检查。然而,实现这一进程的自动化是一项挑战,因为气候变化是一个复杂、快速变化,而且往往含混不清的议题,用于基于文本的广受欢迎的AI任务的资源稀缺。在本文件中,我们引入了基于判决的气候变化专题探测数据集,供我们公开提供。我们探索了在不同文本来源中确定气候变化专题的不同方法。我们发现,基于关键词的流行模型不足以完成如此复杂和不断发展的任务。基于环境的算法,如BERT\cite{devlin201818bert},除了许多小案例外,还可以探测到各种复杂和隐含的话题模式。然而,我们的分析表明,在几个方向上都有很大的改进潜力,例如,例如,开始对关于气候变化间接影响的讨论。因此,我们希望关于这个专题的讨论能够成为良好的议题。