Word embeddings provide an unsupervised way to understand differences in word usage between discursive communities. A number of recent papers have focused on identifying words that are used differently by two or more communities. But word embeddings are complex, high-dimensional spaces and a focus on identifying differences only captures a fraction of their richness. Here, we take a step towards leveraging the richness of the full embedding space, by using word embeddings to map out how words are used differently. Specifically, we describe the construction of dialectograms, an unsupervised way to visually explore the characteristic ways in which each community use a focal word. Based on these dialectograms, we provide a new measure of the degree to which words are used differently that overcomes the tendency for existing measures to pick out low frequent or polysemous words. We apply our methods to explore the discourses of two US political subreddits and show how our methods identify stark affective polarisation of politicians and political entities, differences in the assessment of proper political action as well as disagreement about whether certain issues require political intervention at all.
翻译:字嵌入提供了一种不受监督的方式来理解不同社区之间字用法的差异。 最近的一些论文侧重于识别两个或两个以上社区使用的不同词。 但字嵌入是一个复杂、高维的空间,而侧重于识别差异只捕捉到其丰富程度的一小部分。 在这里,我们迈出了一步,通过用字嵌入来绘制如何使用不同词的地图,来充分利用全部嵌入空间的丰富性。 具体地说,我们描述了方言的构建,这是目视探索每个社区使用一个焦点词的特征的不受监督的方式。 根据这些方言,我们提供了一个新的尺度,说明使用不同词的程度,从而克服了现有措施选取低频或多语的倾向。我们运用了我们的方法来探索两个美国政治子编辑的论述,并展示我们的方法如何识别政治家和政治实体的明显影响极化,在评估适当的政治行动方面的差异,以及在某些问题上是否需要政治干预的问题上存在分歧。