The distribution of sentence length in ordinary language is not well captured by the existing models. Here we survey previous models of sentence length and present our random walk model that offers both a better fit with the data and a better understanding of the distribution. We develop a generalization of KL divergence, discuss measuring the noise inherent in a corpus, and present a hyperparameter-free Bayesian model comparison method that has strong conceptual ties to Minimal Description Length modeling. The models we obtain require only a few dozen bits, orders of magnitude less than the naive nonparametric MDL models would.
翻译:现有模型没有很好地记录普通语言的刑期长度分布。 我们在这里调查了先前的刑期长度模型, 并展示了我们的随机行走模型, 该模型既更适合数据,也更能更好地了解分布情况。 我们开发了 KL 差异的概括化, 讨论测量物质内固有的噪音, 并提出了一个无超参数的Bayesian 模型比较方法, 该方法在概念上与最小描述长度模型有着密切的联系。 我们获得的模型只需要几十个位元, 比天真的非参数 MDL 模型少几个数量级。