This paper faces a central theme in applied statistics and information science, which is the assessment of the stochastic structure of rank-size laws in text analysis. We consider the words in a corpus by ranking them on the basis of their frequencies in descending order. The starting point is that the ranked data generated in linguistic contexts can be viewed as the realisations of a discrete states Markov chain, whose stationary distribution behaves according to a discretisation of the best fitted rank-size law. The employed methodological toolkit is Markov Chain Monte Carlo, specifically referring to the Metropolis-Hastings algorithm. The theoretical framework is applied to the rank-size analysis of the hapax legomena occurring in the speeches of the US Presidents. We offer a large number of statistical tests leading to the consistency of our methodological proposal. To pursue our scopes, we also offer arguments supporting that hapaxes are rare (``extreme") events resulting from memory-less-like processes. Moreover, we show that the considered sample has the stochastic structure of a Markov chain of order one. Importantly, we discuss the versatility of the method, which is considered suitable for deducing similar outcomes for other applied science contexts.
翻译:本文在应用统计和信息科学中面临一个中心主题,即评估文字分析中等级法律的随机结构。我们根据频度按降序排列,对文集中的单词进行评分。出发点是,语言背景中生成的排名数据可被视为离散的Markov州的成就,其固定分布根据最佳等级法的分解进行。采用的方法工具包是Markov Chail Monte Carlo, 具体指的是大都会-哈斯廷斯算法。理论框架适用于美国总统演讲中出现的按频次排列法的等级分析。我们提供了大量统计测试,导致我们的方法建议的一致性。为了追求我们的范围,我们还提出一些论据,支持哈克斯是稀有的(“极端”)因记忆不相近的过程而产生的事件。此外,我们表明,所考虑的样本具有马可夫一号秩序链的随机结构。我们讨论了方法的多变性,认为该方法的多变性适用于类似的结果。我们讨论了其他科学环境。