Autoencoders are commonly used in representation learning. They consist of an encoder and a decoder, which provide a straightforward way to map n-dimensional data in input space to a lower m-dimensional representation space and back. The decoder itself defines an m-dimensional manifold in input space. Inspired by manifold learning, we show that the decoder can be trained on its own by learning the representations of the training samples along with the decoder weights using gradient descent. A sum-of-squares loss then corresponds to optimizing the manifold to have the smallest Euclidean distance to the training samples, and similarly for other loss functions. We derive expressions for the number of samples needed to specify the encoder and decoder and show that the decoder generally requires much less training samples to be well-specified compared to the encoder. We discuss training of autoencoders in this perspective and relate to previous work in the field that use noisy training examples and other types of regularization. On the natural image data sets MNIST and CIFAR10, we demonstrate that the decoder is much better suited to learn a low-dimensional representation, especially when trained on small data sets. Using simulated gene regulatory data, we further show that the decoder alone leads to better generalization and meaningful representations. Our approach of training the decoder alone facilitates representation learning even on small data sets and can lead to improved training of autoencoders. We hope that the simple analyses presented will also contribute to an improved conceptual understanding of representation learning.
翻译:自动解码器通常用于代号学习, 包括一个编码器和一个解码器, 它为绘制输入空间中的非维数据提供了直截了当的方法, 绘制输入空间中的非维数据, 到一个更低的多元代表空间和反向。 解码器本身定义了输入空间中的非维方体。 受多重学习的启发, 我们显示, 解码器可以通过学习使用梯度下坡度显示的培训样本和解码器重量来自我培训。 平方损失总和相当于优化元件, 以便拥有最小的欧克利德距离到培训样本, 类似于其他损失分析中。 我们为指定编码器和解码器所需的样本数量提供表达方式, 并显示解码器通常需要较少的培训样本, 与编码器相比, 我们讨论从这个角度对自动解码器的培训, 与以前使用较吵的训练范例和其他类型的正规化工作有关。 关于自然图像数据集 MNIST 和 CFAR10, 我们证明, 解码器甚至更适合用于更精确的样本, 当我们使用经过训练的模拟的模拟数据演示时, 更能更精确地显示我们更精确的解算数据, 能够更精确地显示我们的基因代表。