结构不变量和词汇“自我网络”中的语义指纹 (Structural invariants and semantic fingerprints in the "ego network" of words)

from arxiv, This work was partially funded by the H2020 SoBigData++ (Grant No 871042), H2020 HumaneAI-Net (Grant No 952026), and CHIST-ERA SAI (Grant No not yet available) projects. arXiv admin note: text overlap with arXiv:2110.06015

Well-established cognitive models coming from anthropology have shown that, due to the cognitive constraints that limit our "bandwidth" for social interactions, humans organize their social relations according to a regular structure. In this work, we postulate that similar regularities can be found in other cognitive processes, such as those involving language production. In order to investigate this claim, we analyse a dataset containing tweets of a heterogeneous group of Twitter users (regular users and professional writers). Leveraging a methodology similar to the one used to uncover the well-established social cognitive constraints, we find regularities at both the structural and semantic level. At the former, we find that a concentric layered structure (which we call ego network of words, in analogy to the ego network of social relationships) very well captures how individuals organise the words they use. The size of the layers in this structure regularly grows (approximately 2-3 times with respect to the previous one) when moving outwards, and the two penultimate external layers consistently account for approximately 60% and 30% of the used words, irrespective of the number of the total number of layers of the user. For the semantic analysis, each ring of each ego network is described by a semantic profile, which captures the topics associated with the words in the ring. We find that ring #1 has a special role in the model. It is semantically the most dissimilar and the most diverse among the rings. We also show that the topics that are important in the innermost ring also have the characteristic of being predominant in each of the other rings, as well as in the entire ego network. In this respect, ring #1 can be seen as the semantic fingerprint of the ego network of words.

翻译：摘要：已经得到人类学的认知模型显示，由于限制我们社交交互带宽的认知约束，人类根据规则结构组织其社交关系。在本文中，我们假设类似的规律也可以在涉及语言生成等其他认知过程中被发现。为了研究这一主张，我们分析了一个包含Twitter 用户的异构数据集（定期用户和专业作家）。利用类似于发现已建立术语认知约束的方法，我们在结构和语义层面上发现了规律。在结构上，我们发现一个同心圆层次结构（我们称之为词语的自我网络，类比社交关系的自我网络）非常好地捕捉了个体如何组织使用的单词。在这个结构中，随着向外移动，层次的大小会定期增长（相对于前一个增加约两到三倍），而倒数第二个外部层次始终占用使用单词的约60％和30％，无论用户的总层数是多少。在语义分析中，每个自我网络的每个环都由一个语义配置文件描述，该配置文件捕获与环中单词相关联的主题。我们发现环#1在模型中具有特殊作用。在语义上，它是最不同和最多样化的环之一。我们还表明，在内部环中重要的主题在每个其他环中以及在整个自我网络中也具有显着的优势。在这方面，第1个环可以被视为词语自我网络的语义指纹。