Most real-world problems that machine learning algorithms are expected to solve face the situation with 1) unknown data distribution; 2) little domain-specific knowledge; and 3) datasets with limited annotation. We propose Non-Parametric learning by Compression with Latent Variables (NPC-LV), a learning framework for any dataset with abundant unlabeled data but very few labeled ones. By only training a generative model in an unsupervised way, the framework utilizes the data distribution to build a compressor. Using a compressor-based distance metric derived from Kolmogorov complexity, together with few labeled data, NPC-LV classifies without further training. We show that NPC-LV outperforms supervised methods on all three datasets on image classification in low data regime and even outperform semi-supervised learning methods on CIFAR-10. We demonstrate how and when negative evidence lowerbound (nELBO) can be used as an approximate compressed length for classification. By revealing the correlation between compression rate and classification accuracy, we illustrate that under NPC-LV, the improvement of generative models can enhance downstream classification accuracy.
翻译:机器学习算法可望解决的最现实世界问题,其所面临的情况是:(1) 未知的数据分布;(2) 很少的域特定知识;(3) 注释有限的数据集。我们建议采用隐性变量(NPC-LV)进行非量性学习,这是任何数据集的学习框架,其中含有大量未贴标签的数据,但只有极少数标签的数据集。仅以不受监督的方式培训一个基因化模型,该框架就利用数据分布来构建压缩机。使用来自科尔莫戈罗夫复杂程度的基于压缩机的距离指标,加上很少的标签数据,NPC-LV分类,而无需进一步培训。我们表明,NPC-LV在低数据系统中对所有三种数据集的监控方法,甚至超过CIFAR-10的半超导式学习方法。我们通过演示低端证据(nELBO)如何以及何时可以用作近似压缩长度的分类。我们通过揭示压缩率和分类精确度之间的关联性,我们说明在NPC-LV下,改进基因化模型可以提高下游分类的准确性。