在文本材料库中对预先界定的专题的动态突出度进行量化的NLP方法 (An NLP approach to quantify dynamic salience of predefined topics in a text corpus)

from arxiv, This paper was presented at the 2021 International Conference on Social Computing, Behavioral-Cultural Modeling Prediction and Behavior Representation in Modeling and Simulation (SBP-BRiMS), 9 July 2021

The proliferation of news media available online simultaneously presents a valuable resource and significant challenge to analysts aiming to profile and understand social and cultural trends in a geographic location of interest. While an abundance of news reports documenting significant events, trends, and responses provides a more democratized picture of the social characteristics of a location, making sense of an entire corpus to extract significant trends is a steep challenge for any one analyst or team. Here, we present an approach using natural language processing techniques that seeks to quantify how a set of pre-defined topics of interest change over time across a large corpus of text. We found that, given a predefined topic, we can identify and rank sets of terms, or n-grams, that map to those topics and have usage patterns that deviate from a normal baseline. Emergence, disappearance, or significant variations in n-gram usage present a ground-up picture of a topic's dynamic salience within a corpus of interest.

翻译：在线新闻媒介的激增同时为旨在描述和理解有关地理位置的社会和文化趋势的分析人员提供了宝贵的资源和重大挑战。虽然大量记录重大事件、趋势和答复的新闻报道能够更民主化地描述一个地点的社会特征,但任何一位分析员或团队都面临一个巨大的挑战。在这里,我们提出一种使用自然语言处理技术的方法,以量化一组预先界定的感兴趣议题如何在大量文本中随时间变化。我们发现,鉴于一个预先界定的主题,我们可以确定一组术语或n-gram,并排列这些术语或n-gram,这些术语或n-gram与这些主题相映射,其使用模式偏离了正常基线。新出现、消失或n-gram使用上的重大变化在一系列兴趣中呈现出一个专题动态突出的原始画面。