Language modeling, a central task in natural language processing, involves estimating a probability distribution over strings. In most cases, the estimated distribution sums to 1 over all finite strings. However, in some pathological cases, probability mass can ``leak'' onto the set of infinite sequences. In order to characterize the notion of leakage more precisely, this paper offers a measure-theoretic treatment of language modeling. We prove that many popular language model families are in fact tight, meaning that they will not leak in this sense. We also generalize characterizations of tightness proposed in previous works.
翻译:语言建模是自然语言处理过程中的一项核心任务,它涉及估计字符串的概率分布。在多数情况下,估计的分布总和超过所有有限字符串的1。 但是,在某些病理上,概率质量可以“渗漏”到一组无限序列上。为了更准确地说明渗漏的概念,本文件对语言建模进行了量测理处理。我们证明,许多流行语言建模家庭实际上很紧凑,这意味着它们不会从这个意义上渗漏。我们还概括了以前作品中提议的紧凑性特征。