Zipf's law defines an inverse proportion between a word's ranking in a given corpus and its frequency in it, roughly dividing the vocabulary into frequent words and infrequent ones. Here, we stipulate that within a domain an author's signature can be derived from, in loose terms, the author's missing popular words and frequently used infrequent-words. We devise a method, termed Latent Personal Analysis (LPA), for finding domain-based attributes for entities in a domain: their distance from the domain and their signature, which determines how they most differ from a domain. We identify the most suitable distance metric for the method among several and construct the distances and personal signatures for authors, the domain's entities. The signature consists of both over-used terms (compared to the average), and missing popular terms. We validate the correctness and power of the signatures in identifying users and set existence conditions. We then show uses for the method in explainable authorship attribution: we define algorithms that utilize LPA to identify two types of impersonation in social media: (1) authors with sockpuppets (multiple) accounts; (2) front users accounts, operated by several authors. We validate the algorithms and employ them over a large scale dataset obtained from a social media site with over 4000 users. We corroborate these results using temporal rate analysis. LPA can further be used to devise personal attributes in a wide range of scientific domains in which the constituents have a long-tail distribution of elements.
翻译:Zipf 的法律定义了某个词在特定主题中的排名与其频率之间的反比比例,大致上将词汇划分为频繁的词汇和不常见的词汇。 这里, 我们规定, 在一个域内, 作者的签名可以从作者缺失的流行词汇和经常使用的不常用词汇中, 以松散的措辞, 得出作者的签名。 我们设计了一种方法, 称为“ 隐性个人分析 ” (LPA ), 用于为某个域的实体查找基于域的属性: 它们的距离和签名, 确定它们与域域内的差异最大。 我们为该方法确定了最合适的距离度, 我们为该方法确定了几个域间方法的最合适距离尺度, 并为作者、 和作者、 域内实体、 建立距离和个人签名 。 我们验证了签名在识别用户和设定存在条件方面的正确性和权力。 然后, 我们用LPA 定义了使用两种类型的算法, 在社会媒体界域内, 使用一个大范围的用户账户, 使用我们使用一个大比例 。