大规模毁灭性武器构成:在社会文化分析中利用文字移动者距离的解释性特性的框架 (WMDecompose: A Framework for Leveraging the Interpretable Properties of Word Mover's Distance in Sociocultural Analysis)

Despite the increasing popularity of NLP in the humanities and social sciences, advances in model performance and complexity have been accompanied by concerns about interpretability and explanatory power for sociocultural analysis. One popular model that balances complexity and legibility is Word Mover's Distance (WMD). Ostensibly adapted for its interpretability, WMD has nonetheless been used and further developed in ways which frequently discard its most interpretable aspect: namely, the word-level distances required for translating a set of words into another set of words. To address this apparent gap, we introduce WMDecompose: a model and Python library that 1) decomposes document-level distances into their constituent word-level distances, and 2) subsequently clusters words to induce thematic elements, such that useful lexical information is retained and summarized for analysis. To illustrate its potential in a social scientific context, we apply it to a longitudinal social media corpus to explore the interrelationship between conspiracy theories and conservative American discourses. Finally, because of the full WMD model's high time-complexity, we additionally suggest a method of sampling document pairs from large datasets in a reproducible way, with tight bounds that prevent extrapolation of unreliable results due to poor sampling practices.

翻译：尽管在人文科学和社会科学中国家实验室越来越受欢迎,但模型性表现和复杂性的进步伴随着对社会文化分析的解释性和解释力的关切。一个兼顾复杂性和可辨度的流行模式是“Word Moler”距离(World Moler's Learth) 。尽管可以合理调整,但大规模毁灭性武器的使用和进一步发展方式经常抛弃其最易解的方面:即将一套词转换成另一套词所需的字级距离。为了解决这一明显差距,我们引入了大规模毁灭性武器:一个模型和Python图书馆,它1)将文件水平的距离分解成其构成的字级距离,2)随后将一些文字分组,以诱导出主题要素,例如有用的词汇信息被保留和总结以供分析。为了在社会科学背景下说明其潜力,我们将其应用于一个纵向的社会媒体系统,以探讨阴谋理论和美国保守言论之间的相互关系。最后,由于整个大规模毁灭性武器模型的时间性很高,我们建议采用一种方法,从大型数据集取样,以可追溯性强的方式从大量数据集成。

相关内容

MoDELS

关注 44

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【杜克-Bhuwan Dhingra】语言模型即知识图谱，46页ppt

专知会员服务

67+阅读 · 2021年11月15日

MIT经典《线性代数》，584页pdf，Introduction to Linear Algebra, Fifth Edition, Gilbert Strang, 2016.

专知会员服务

431+阅读 · 2021年1月11日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日