To cluster, classify and represent are three fundamental objectives of learning from high-dimensional data with intrinsic structure. To this end, this paper introduces three interpretable approaches, i.e., segmentation (clustering) via the Minimum Lossy Coding Length criterion, classification via the Minimum Incremental Coding Length criterion and representation via the Maximal Coding Rate Reduction criterion. These are derived based on the lossy data coding and compression framework from the principle of rate distortion in information theory. These algorithms are particularly suitable for dealing with finite-sample data (allowed to be sparse or almost degenerate) of mixed Gaussian distributions or subspaces. The theoretical value and attractive features of these methods are summarized by comparison with other learning methods or evaluation criteria. This summary note aims to provide a theoretical guide to researchers (also engineers) interested in understanding 'white-box' machine (deep) learning methods.
翻译:为此,本文件提出了三种可解释的方法,即通过最低损失编码长度标准进行分解(分组),通过最低递增编码长度标准进行分解,并通过最大编码降低率标准进行分解。这些根据损失数据编码和压缩框架得出,取自信息理论中率扭曲原则。这些算法特别适合处理混合高斯分布或子空间的有限抽样数据(可以稀释或几乎退化)。这些方法的理论价值和吸引力特征通过与其他学习方法或评价标准进行比较加以总结。本摘要说明的目的是为有兴趣了解“白箱”机(深视)学习方法的研究人员(也是工程师)提供理论指南。