Suppose that the only available information in a multi-class problem are expert estimates of the conditional probabilities of occurrence for a set of binary features. The aim is to select a subset of features to be measured in subsequent data collection experiments. In the lack of any information about the dependencies between the features, we assume that all features are conditionally independent and hence choose the Naive Bayes classifier as the optimal classifier for the problem. Even in this (seemingly trivial) case of complete knowledge of the distributions, choosing an optimal feature subset is not straightforward. We discuss the properties and implementation details of Sequential Forward Selection (SFS) as a feature selection procedure for the current problem. A sensitivity analysis was carried out to investigate whether the same features are selected when the probabilities vary around the estimated values. The procedure is illustrated with a set of probability estimates for Scrapie in sheep.
翻译:假设一个多类问题中的唯一可用信息是一组二元特征的有条件发生概率的专家估计。目的是选择在随后的数据收集实验中测量的一系列特征。由于缺乏关于这些特征之间依赖性的任何信息,我们假设所有特征都是有条件独立的,因此选择Naive Bayes分类器作为问题的最佳分类器。即使在此(似乎微不足道的)完全了解分布的案例中,选择一个最佳特征子集并不简单。我们讨论了顺序前期选择的属性和执行细节,作为当前问题的特征选择程序。进行了敏感性分析,以调查在估计值的概率不同时是否选择了相同的特征。该程序用一套羊毛滑坡概率估算来说明。