The article introduces corrections to Zipf's and Heaps' laws based on systematic models of the proportion of hapaxes, i.e., words that occur once. The derivation rests on two assumptions: The first one is the standard urn model which predicts that marginal frequency distributions for shorter texts look as if word tokens were sampled blindly from a given longer text. The second assumption posits that the hapax rate is a simple function of the text length. Four such functions are discussed: the constant model, the Davis model, the linear model, and the logistic model. It is shown that the logistic model yields the best fit.
翻译:暂无翻译