This paper uses techniques from Random Matrix Theory to find the ideal training-testing data split for a simple linear regression with m data points, each an independent n-dimensional multivariate Gaussian. It defines "ideal" as satisfying the integrity metric, i.e. the empirical model error is the actual measurement noise, and thus fairly reflects the value or lack of same of the model. This paper is the first to solve for the training and test size for any model in a way that is truly optimal. The number of data points in the training set is the root of a quartic polynomial Theorem 1 derives which depends only on m and n; the covariance matrix of the multivariate Gaussian, the true model parameters, and the true measurement noise drop out of the calculations. The critical mathematical difficulties were realizing that the problems herein were discussed in the context of the Jacobi Ensemble, a probability distribution describing the eigenvalues of a known random matrix model, and evaluating a new integral in the style of Selberg and Aomoto. Mathematical results are supported with thorough computational evidence. This paper is a step towards automatic choices of training/test set sizes in machine learning.
翻译:本文使用来自随机矩阵理论的技术, 找到理想的培训测试数据, 用于使用 m 数据点的简单线性回归, 每个独立的 n- 维维多变量 Gaussian 。 它将“ 理想” 定义为满足完整性度量, 即实验模型错误是实际测量噪音, 从而公正地反映了模型的价值, 从而公正地反映了同一模型的价值 。 本文是任何模型的培训和测试大小第一个以真正最佳的方式解决的。 培训集中的数据点数是仅取决于 m 和 n 的二次数多边理论1 的根; 多变量高斯、 真正的模型参数的共变量矩阵, 以及真正的测量噪音退出计算 。 关键的数学困难是认识到这里的问题是在 Jacobi Ensemble的背景下讨论的, 这是描述已知随机矩阵模型的双元值的概率分布, 并且评估Selberg 和 Aomotomoto 风格中的新的组成部分 。 数学结果得到了彻底的计算证据的支持。 此文档是向自动选择的机器学习的一步。