This paper presents a novel approach to the acquisition of language models from corpora. The framework builds on Cobweb, an early system for constructing taxonomic hierarchies of probabilistic concepts that used a tabular, attribute-value encoding of training cases and concepts, making it unsuitable for sequential input like language. In response, we explore three new extensions to Cobweb -- the Word, Leaf, and Path variants. These systems encode each training case as an anchor word and surrounding context words, and they store probabilistic descriptions of concepts as distributions over anchor and context information. As in the original Cobweb, a performance element sorts a new instance downward through the hierarchy and uses the final node to predict missing features. Learning is interleaved with performance, updating concept probabilities and hierarchy structure as classification occurs. Thus, the new approaches process training cases in an incremental, online manner that it very different from most methods for statistical language learning. We examine how well the three variants place synonyms together and keep homonyms apart, their ability to recall synonyms as a function of training set size, and their training efficiency. Finally, we discuss related work on incremental learning and directions for further research.
翻译:本文介绍了从公司获取语言模型的新做法。 框架以Cobweb 为基础,这是一个早期构建概率概念分类分类等级的早期系统,它使用一个表格、属性值的培训案例和概念编码,使其不适合像语言一样的顺序输入。 作为回应,我们探索了Cobwe 的三个新的扩展 -- -- Word、Leaf 和 Path 变量。这些系统将每个培训案例编码成一个主词和周围上下文词,并储存了概念的概率性描述,作为锚值和上下文信息的分布。在原Cobweb 中,一个性能元素将一个新的实例从等级向下移,并使用最后节点来预测缺失的特征。 学习与业绩互连,随着分类的发生更新概念概率和等级结构。 因此,新的方法以递增、在线方式处理培训案例,它与大多数统计语言学习方法大不相同。 我们考察三个变异地名是如何在一起的,并保持同共性关系,我们讨论它们作为培训方向和学习效率的累进功能的能力。