通过同时培训神经网络和粗化编码从不完整数据中学习 (Learning from Incomplete Data by Simultaneous Training of Neural Networks and Sparse Coding)

Handling correctly incomplete datasets in machine learning is a fundamental and classical challenge. In this paper, the problem of training a classifier on a dataset with missing features, and its application to a complete or incomplete test dataset, is addressed. A supervised learning method is developed to train a general classifier, such as a logistic regression or a deep neural network, using only a limited number of features per sample, while assuming sparse representations of data vectors on an unknown dictionary. The pattern of missing features is allowed to be different for each input data instance and can be either random or structured. The proposed method simultaneously learns the classifier, the dictionary and the corresponding sparse representation of each input data sample. A theoretical analysis is provided, comparing this method with the standard imputation approach, which consists of performing data completion followed by training the classifier with those reconstructions. Sufficient conditions are identified such that, if it is possible to train a classifier on incomplete observations so that their reconstructions are well separated by a hyperplane, then the same classifier also correctly separates the original (unobserved) data samples. Extensive simulation results on synthetic and well-known reference datasets are presented that validate our theoretical findings and demonstrate the effectiveness of the proposed method compared to traditional data imputation approaches and one state of the art algorithm.

翻译：在机器学习中正确处理不完全的数据集是一项根本性的典型挑战。在本文中,对缺少特征的数据集及其应用于完整或不完整的测试数据集的培训分类员的培训问题得到了解决。开发了一种监督的学习方法,对普通分类员进行培训,例如后勤回归或深神经网络,每个样本只使用数量有限的特征,同时假设在未知字典上对数据矢量的描述很少,而对于一个未知字典上的数据矢量则允许有差异。每个输入数据实例的缺失特征模式允许不同,可以随机或结构化。拟议的方法同时学习了分类员、词典和每个输入数据样本的相应稀疏代表性。提供了理论分析,将这一方法与标准估算方法进行比较,该方法包括完成数据,然后对分类员进行与这些重建的培训。确定了充分的条件,如果有可能对不完整的观测进行分类员进行培训,以便其重建由超平流机进行很好的分离,那么同一分类也能够正确区分原始(未观测过的)数据样本。提供了一种关于合成和广为人知的参考方法的模拟结果,用以验证我们提出的传统数据分析方法的理论和比较。