Machine learning often needs to estimate density from a multidimensional data sample, where we would also like to model correlations between coordinates. Additionally, we often have missing data case: that data points have only partial information - can miss information about some coordinates. This paper adapts rapid parametric density estimation technique for this purpose: modelling density as a linear combination, for which $L^2$ optimization says that estimated coefficient for a given function is just average over the sample of this function. Hierarchical correlation reconstruction first models probability density for each separate coordinate using all its appearances in data sample, then adds corrections from independently modelled pairwise correlations using all samples having both coordinates, and so on independently adding correlations for growing numbers of variables using decreasing evidence in our data sample. A basic application of such modelled multidimensional density can be imputation of missing coordinates: by inserting known coordinates to the density, and taking expected values for the missing coordinates, and maybe also variance to estimate their uncertainty.
翻译:机器学习通常需要从多维数据样本中估算密度, 在那里我们也想建模坐标之间的关联。 此外, 我们经常缺少数据案例: 数据点只有部分信息, 可能缺少某些坐标的信息。 本文为此调整了快速参数密度估计技术: 将密度建模成线性组合, 其中,$L2$优化表示, 特定功能的估计系数仅比此功能的样本平均。 等级相关重建第一模型, 使用数据样本中的所有外观, 每一个单独的协调点的概率密度, 然后使用所有具有坐标的样本, 加上独立模拟的对对等相关点的校正, 从而独立地添加使用数据样本中不断减少的证据的变量数量的相关点。 这种模型多维密度的基本应用可以是估算缺失坐标: 将已知的坐标插入已知的坐标, 并使用缺失坐标的预期值, 以及可能的差异来估计其不确定性。