This paper proposes a new algorithm for an automatic variable selection procedure in High Dimensional Graphical Models. The algorithm selects the relevant variables for the node of interest on the basis of mutual information. Several contributions in literature have investigated the use of mutual information in selecting the appropriate number of relevant features in a large data-set, but most of them have focused on binary outcomes or required high computational effort. The algorithm here proposed overcomes these drawbacks as it is an extension of Chow and Liu's algorithm. Once, the probabilistic structure of a High Dimensional Graphical Model is determined via the said algorithm, the best path-step, including variables with the most explanatory/predictive power for a variable of interest, is determined via the computation of the entropy coefficient of determination. The latter, being based on the notion of (symmetric) Kullback-Leibler divergence, turns out to be closely connected to the mutual information of the involved variables. The application of the algorithm to a wide range of real-word and publicly data-sets has highlighted its potential and greater effectiveness compared to alternative extant methods.
翻译:本文为高维图形模型的自动变量选择程序提出了一个新的算法。 算法在相互信息的基础上选择相关变量作为相关节点。 文献方面的一些贡献调查了在选择大型数据集中相关特征的适当数量时使用相互信息的情况, 但大部分集中在二进制结果或需要高计算努力上。 这里提议的算法克服了这些缺点,因为它是周和刘的算法的延伸。 一旦高维图形模型的概率结构通过上述算法确定, 最佳路径步骤, 包括具有最大解释力/预测力的变量, 以利害变量为主, 是通过计算确定模型的英特罗比系数来确定的。 后者基于( 符号性) Kullback- Leibertr 差异的概念, 与所涉变量的相互信息密切相关。 将算法应用于广泛的真实词和公开数据集, 与替代的外延方法相比, 突出其潜力和更大的有效性。