High-dimensional data sets are often analyzed and explored via the construction of a latent low-dimensional space which enables convenient visualization and efficient predictive modeling or clustering. For complex data structures, linear dimensionality reduction techniques like PCA may not be sufficiently flexible to enable low-dimensional representation. Non-linear dimension reduction techniques, like kernel PCA and autoencoders, suffer from loss of interpretability since each latent variable is dependent of all input dimensions. To address this limitation, we here present path lasso penalized autoencoders. This structured regularization enhances interpretability by penalizing each path through the encoder from an input to a latent variable, thus restricting how many input variables are represented in each latent dimension. Our algorithm uses a group lasso penalty and non-negative matrix factorization to construct a sparse, non-linear latent representation. We compare the path lasso regularized autoencoder to PCA, sparse PCA, autoencoders and sparse autoencoders on real and simulated data sets. We show that the algorithm exhibits much lower reconstruction errors than sparse PCA and parameter-wise lasso regularized autoencoders for low-dimensional representations. Moreover, path lasso representations provide a more accurate reconstruction match, i.e. preserved relative distance between objects in the original and reconstructed spaces.
翻译:高维数据集往往会通过建造潜伏低维空间来分析和探索,这种潜伏低维空间可以方便地视觉化和高效地预测模型或集群。对于复杂的数据结构来说,像五氯苯甲醚这样的线性维度减少技术可能不够灵活,无法进行低维代表。非线性减少技术,如五氯苯甲醚和自动编码器等,由于每个潜伏变量取决于所有输入维度,因此会丧失可解释性。为了应对这一限制,我们在这里展示路径 lasso 惩罚自动编码器。这种结构化的正规化加强了可解释性,通过从输入到潜伏变量的编码器对每一个路径进行处罚,从而限制每个潜伏维度维度中有多少输入变量。我们的算法使用一个组的 lasso 处罚和非负维度矩阵因子化来构建一个稀薄的非线性、 非线性潜伏性代表器。我们比较了每个潜在变量的正统化自动编码器与五氯苯甲醚、 自定义的自动编码器和稀薄的自动编码器化的自动编码器。我们显示的重建错误要小得多于稀薄的五氯苯和参数的拉索常规化的硬度变量,在低维度空间的物体的相对的构造中提供。