The success of modern machine learning algorithms depends crucially on efficient data representation and compression through dimensionality reduction. This practice seemingly contradicts the conventional intuition suggesting that data processing always leads to information loss. We prove that this intuition is wrong. For any non-convex problem, there exists an optimal, benign auto-encoder (BAE) extracting a lower-dimensional data representation that is strictly beneficial: Compressing model inputs improves model performance. We prove that BAE projects data onto a manifold whose dimension is the compressibility dimension of the learning model. We develop and implement an efficient algorithm for computing BAE and show that BAE improves model performance in every dataset we consider. Furthermore, by compressing "malignant" data dimensions, BAE makes learning more stable and robust.
翻译:现代机器学习算法的成功关键取决于高效率的数据代表性和通过减少维度压缩压缩数据。 这种做法似乎与传统直觉相矛盾, 表明数据处理总是导致信息丢失。 我们证明这种直觉是错误的。 对于任何非隐形问题, 都存在最佳、 良性的自动编码器(BAE), 提取一个非常有利的低维数据表达法: 压缩模型投入可以改善模型性能。 我们证明 BAE 将数据投放到一个方块上, 其维度是学习模型的压缩维度。 我们为计算 BAE 开发和实施一种高效的算法, 并表明 BAE 改善了我们所考虑的每个数据集的模型性能。 此外, 通过压缩“ 错误” 数据维度, BAE 使学习更加稳定和稳健。