The central question in representation learning is what constitutes a good or meaningful representation. In this work we argue that if we consider data with inherent cluster structures, where clusters can be characterized through different means and covariances, those data structures should be represented in the embedding as well. While Autoencoders (AE) are widely used in practice for unsupervised representation learning, they do not fulfil the above condition on the embedding as they obtain a single representation of the data. To overcome this we propose a meta-algorithm that can be used to extend an arbitrary AE architecture to a tensorized version (TAE) that allows for learning cluster-specific embeddings while simultaneously learning the cluster assignment. For the linear setting we prove that TAE can recover the principle components of the different clusters in contrast to principle component of the entire data recovered by a standard AE. We validated this on planted models and for general, non-linear and convolutional AEs we empirically illustrate that tensorizing the AE is beneficial in clustering and de-noising tasks.
翻译:代表性学习的核心问题是何为良好或有意义的代表性。在这项工作中,我们提出,如果我们考虑与固有集群结构有关的数据,即集群可以通过不同方式和共变来定性,则这些数据结构也应在嵌入中有所体现。虽然自动编码器(AE)在实践上被广泛用于未经监督的代表性学习,但当它们获得单一的数据代表时,它们没有满足上述嵌入的条件。要克服这一点,我们提议一个元等级,可以用来将任意的自动计量器结构扩展至可同时学习集群任务的同时学习特定集群嵌入的多元版本(TAE)。关于线性设置,我们证明,与标准AE所回收的全部数据的原则组成部分相反,技术编码器可以恢复不同集群的主要组成部分。我们用人造模型和一般、非线性和革命性AE验证了这一点,我们从经验上表明,扩展自动计量器在集群和去注意任务方面是有益的。