In this paper, the problem of training a classifier on a dataset with incomplete features is addressed. We assume that different subsets of features (random or structured) are available at each data instance. This situation typically occurs in the applications when not all the features are collected for every data sample. A new supervised learning method is developed to train a general classifier, such as a logistic regression or a deep neural network, using only a subset of features per sample, while assuming sparse representations of data vectors on an unknown dictionary. Sufficient conditions are identified, such that, if it is possible to train a classifier on incomplete observations so that their reconstructions are well separated by a hyperplane, then the same classifier also correctly separates the original (unobserved) data samples. Extensive simulation results on synthetic and well-known datasets are presented that validate our theoretical findings and demonstrate the effectiveness of the proposed method compared to traditional data imputation approaches and one state-of-the-art algorithm.
翻译:在本文中,对一组具有不完整特征的数据集的分类人员进行培训的问题得到了解决;我们假定每个数据实例都有不同的特征组别(随机或结构化),这种情况通常发生在并非为每个数据样本收集所有特征的应用中;开发了一种新的监督学习方法,培训一个普通分类人员,例如后勤回归或深神经网络,只使用每个样本的一组特征,同时假设在未知字典上的数据矢量很少;确定了充分的条件,因此,如果有可能对一个分类人员进行不完全的观测培训,以便使其重建由超高平面很好地分离,那么同一分类人员也正确地区分原始(未观测的)数据样本。还介绍了关于合成和已知数据集的广泛模拟结果,以证实我们的理论结果,并表明拟议方法与传统的数据集法和一种最先进的算法相比的有效性。