In real world machine learning applications, testing data may contain some meaningful new categories that have not been seen in labeled training data. To simultaneously recognize new data categories and assign most appropriate category labels to the data actually from known categories, existing models assume the number of unknown new categories is pre-specified, though it is difficult to determine in advance. In this paper, we propose a Bayesian nonparametric topic model to automatically infer this number, based on the hierarchical Dirichlet process and the notion of latent Dirichlet allocation. Exact inference in our model is intractable, so we provide an efficient collapsed Gibbs sampling algorithm for approximate posterior inference. Extensive experiments on various text data sets show that: (a) compared with parametric approaches that use pre-specified true number of new categories, the proposed nonparametric approach can yield comparable performance; and (b) when the exact number of new categories is unavailable, i.e. the parametric approaches only have a rough idea about the new categories, our approach has evident performance advantages.
翻译:在现实世界机器学习应用程序中,测试数据可能包含一些在标签培训数据中看不到的有意义的新类别。为了同时识别新的数据类别,并为已知类别中的数据实际指定最适当的类别标签,现有模型假定了未知的新类别数目是预先指定的,尽管很难事先确定。在本文中,我们提议了一种巴耶斯非参数性专题模型,以便根据等级分级Drichlet进程和潜伏Drichlet分配概念自动推断这一数目。我们模型中的精确度是难以确定的,因此我们为近似远地点推断提供了高效的崩溃 Gib 抽样算法。关于各种文本数据集的广泛实验表明:(a) 与使用预先确定的新类别真正数目的参数性方法相比,拟议的非参数性方法可以产生可比较的性能;(b) 当没有确切的新类别数目时,即参数性方法对新类别只有粗略的了解,我们的方法具有明显的性能优势。