Statistical analysis of massive datasets very often implies expensive linear algebra operations with large dense matrices. Typical tasks are an estimation of unknown parameters of the underlying statistical model and prediction of missing values. We developed the H-MLE procedure, which solves these tasks. The unknown parameters can be estimated by maximizing the joint Gaussian log-likelihood function, which depends on a covariance matrix. To decrease high computational cost, we approximate the covariance matrix in the hierarchical (H-) matrix format. The H-matrix technique allows us to work with inhomogeneous covariance matrices and almost arbitrary locations. Especially, H-matrices can be applied in cases when the matrices under consideration are dense and unstructured. For validation purposes, we implemented three machine learning methods: the k-nearest neighbors (kNN), random forest, and deep neural network. The best results (for the given datasets) were obtained by the kNN method with three or seven neighbors depending on the dataset. The results computed with the H-MLE method were compared with the results obtained by the kNN method. The developed H-matrix code and all datasets are freely available online.
翻译:大型数据集的统计分析往往意味着使用大量密度基质进行昂贵的线性代数操作。典型的任务是估算基本统计模型的未知参数和对缺失值的预测。我们开发了H-MLE程序,解决了这些任务。未知参数可以通过尽量扩大Gausian 联合日志相似功能来估计,这取决于共变矩阵。为了降低高计算成本,我们以等级(H-) 矩阵格式比较了共变矩阵。H矩阵技术允许我们使用不相容的共变矩阵和几乎是任意的定位。特别是,在考虑中的矩阵密度和无结构的情况下,H矩阵可以应用。为了验证目的,我们采用了三种机器学习方法:K-earest邻居(kNNN)、随机森林和深神经网络。通过 kNN(给定数据集)方法获得的最佳结果是三个或七个邻居,取决于数据集。用H-MLE方法计算的结果与KNNT方法获得的结果是自由比较的。开发的Hmatrix和所有数据都是在线数据。