Natural language processing (NLP) is a promising approach for analyzing large volumes of climate-change and infrastructure-related scientific literature. However, best-in-practice NLP techniques require large collections of relevant documents (corpus). Furthermore, NLP techniques using machine learning and deep learning techniques require labels grouping the articles based on user-defined criteria for a significant subset of a corpus in order to train the supervised model. Even labeling a few hundred documents with human subject-matter experts is a time-consuming process. To expedite this process, we developed a weak supervision-based NLP approach that leverages semantic similarity between categories and documents to (i) establish a topic-specific corpus by subsetting a large-scale open-access corpus and (ii) generate category labels for the topic-specific corpus. In comparison with a months-long process of subject-matter expert labeling, we assign category labels to the whole corpus using weak supervision and supervised learning in about 13 hours. The labeled climate and NCF corpus enable targeted, efficient identification of documents discussing a topic (or combination of topics) of interest and identification of various effects of climate change on critical infrastructure, improving the usability of scientific literature and ultimately supporting enhanced policy and decision making. To demonstrate this capability, we conduct topic modeling on pairs of climate hazards and NCFs to discover trending topics at the intersection of these categories. This method is useful for analysts and decision-makers to quickly grasp the relevant topics and most important documents linked to the topic.
翻译:自然语言处理(NLP)是分析大量气候变化和基础设施相关科学文献的一个很有希望的方法,然而,最佳实践的NLP技术需要大量相关文件(肉体)的收集;此外,使用机器学习和深层学习技术的NLP技术需要根据用户定义的标准对文章进行分类,以对大量内容进行分类,从而对受监督的模式进行培训。即使将几百份文件与人类主题事项专家贴上标签,也是一个耗时的过程。为加快这一进程,我们开发了一个薄弱的基于监督的NLP方法,利用类别和文件之间的语义相似性,以便(一)通过分设大规模开放性资料和深层学习技术,建立一个专题性文件库(二) 使用机器学习和深层学习技术技术,要求根据用户定义标准对文章进行分类。与长达数月的专题专家标签相比,我们利用薄弱的监管和监督性学习,为整个文件设置了分类。为了加快这一进程,我们制定了有针对性和高效率地确定一个专题(或专题组合),用以确定在各类类别和文件之间具有相似性的内容:(一)通过分设置一个大型的开放性资料来建立一个专题来建立一套专题,从而确定一个专题,从而确定一个专题的专有针对性地确定一个专题的专有针对性和确定具体专题的专要害关系。