The ability to estimate joint, conditional and marginal probability distributions over some set of variables is of great utility for many common machine learning tasks. However, estimating these distributions can be challenging, particularly in the case of data containing a mix of discrete and continuous variables. This paper presents a non-parametric method for estimating these distributions directly from a dataset. The data are first represented as a graph consisting of object nodes and attribute value nodes. Depending on the distribution to be estimated, an appropriate eigenvector equation is then constructed. This equation is then solved to find the corresponding stationary distribution of the graph, from which the required distributions can then be estimated and sampled from. The paper demonstrates how the method can be applied to many common machine learning tasks including classification, regression, missing value imputation, outlier detection, random vector generation, and clustering.
翻译:对于许多共同的机器学习任务来说,估计某些变量的联合、有条件和边际概率分布的能力是非常有用的。然而,估计这些分布可能具有挑战性,特别是在包含离散和连续变量混合的数据的情况下。本文介绍了直接从数据集中估算这些分布的非参数方法。数据首先作为由对象节点和属性值节点组成的图表来表示。根据估计的分布,然后构建一个适当的叶质方程。然后解析这个方程,以找到相应的图形固定分布,然后从中估算和取样所需的分布。本文展示了该方法如何应用于许多共同的机器学习任务,包括分类、回归、缺失值估计、外观检测、随机矢量生成和集成。