After being trained on a fully-labeled training set, where the observations are grouped into a certain number of known classes, novelty detection methods aim to classify the instances of an unlabeled test set while allowing for the presence of previously unseen classes. These models are valuable in many areas, ranging from social network and food adulteration analyses to biology, where an evolving population may be present. In this paper, we focus on a two-stage Bayesian semiparametric novelty detector, also known as Brand, recently introduced in the literature. Leveraging on a model-based mixture representation, Brand allows clustering the test observations into known training terms or a single novelty term. Furthermore, the novelty term is modeled with a Dirichlet Process mixture model to flexibly capture any departure from the known patterns. Brand was originally estimated using MCMC schemes, which are prohibitively costly when applied to high-dimensional data. To scale up Brand applicability to large datasets, we propose to resort to a variational Bayes approach, providing an efficient algorithm for posterior approximation. We demonstrate a significant gain in efficiency and excellent classification performance with thorough simulation studies. Finally, to showcase its applicability, we perform a novelty detection analysis using the openly-available Statlog dataset, a large collection of satellite imaging spectra, to search for novel soil types.
翻译:在经过全标签培训后,在全标签培训组中,观测被归为一定数量的已知类别,新颖的检测方法旨在将未贴标签的测试组的事例分类,同时允许存在先前的未知类别。这些模型在许多领域都有价值,从社会网络和食品通奸分析到生物学,从社会网络和食品通奸分析到生物,可能出现变化的人口。在本文件中,我们侧重于两阶段的巴伊西亚半对称新发现探测器,也称为Brand,最近在文献中引入了这种探测器。利用基于模型的混合物表示法,Brand允许将测试观测集成到已知的培训术语或单一的新颖术语中。此外,新颖的术语以Drichlet进程混合模型为模型,灵活地捕捉与已知模式的任何偏离。Brand最初使用MC计划估算,在应用高维数据时成本过高。为了扩大品牌对大型数据集的适用性,我们建议采用一种变式贝氏方法,为近代相校准提供一种高效的算法。我们展示了高效和极好的分类绩效,通过彻底的模拟模拟研究,我们展示了高额的土壤图像分析。最后的图像分析,我们展示了一种新型的可变现式的图像分析。