In a typical supervised machine learning setting, the predictions on all test instances are based on a common subset of features discovered during model training. However, using a different subset of features that is most informative for each test instance individually may not only improve prediction accuracy, but also the overall interpretability of the model. At the same time, feature selection methods for classification have been known to be the most effective when many features are irrelevant and/or uncorrelated. In fact, feature selection ignoring correlations between features can lead to poor classification performance. In this work, a Bayesian network is utilized to model feature dependencies. Using the dependency network, a new method is proposed that sequentially selects the best feature to evaluate for each test instance individually, and stops the selection process to make a prediction once it determines that no further improvement can be achieved with respect to classification accuracy. The optimum number of features to acquire and the optimum classification strategy are derived for each test instance. The theoretical properties of the optimum solution are analyzed, and a new algorithm is proposed that takes advantage of these properties to implement a robust and scalable solution for high dimensional settings. The effectiveness, generalizability, and scalability of the proposed method is illustrated on a variety of real-world datasets from diverse application domains.
翻译:在典型的受监督的机器学习环境中,对所有测试情况的预测都基于在模型培训期间发现的一个共同的一组特征。然而,使用对每个测试实例而言信息最丰富的不同一组特征,不仅可以提高预测准确性,而且可以提高模型的总体可解释性。与此同时,在很多特征不相关和/或不相干的情况下,已知分类的特征选择方法最为有效。事实上,特征选择忽视特征之间的关联,可能导致分类性能差。在这项工作中,利用巴耶斯网络来模拟特征依赖性。利用依赖性网络,提出了一种新方法,按顺序选择对每个测试实例进行单独评估的最佳特征,并在确定在分类准确性方面无法取得进一步改进时,停止选择过程作出预测。每个测试实例的最佳特征和最佳分类战略是产生的。对最佳解决方案的理论特性进行了分析,并提出了新的算法,利用这些特性来为高维度环境实施一个稳健和可伸缩的解决方案。从真实性、一般性、可变性和可变性数据方法中展示了真实性、可变性。