Estimating conditional mutual information (CMI) is an essential yet challenging step in many machine learning and data mining tasks. Estimating CMI from data that contains both discrete and continuous variables, or even discrete-continuous mixture variables, is a particularly hard problem. In this paper, we show that CMI for such mixture variables, defined based on the Radon-Nikodym derivate, can be written as a sum of entropies, just like CMI for purely discrete or continuous data. Further, we show that CMI can be consistently estimated for discrete-continuous mixture variables by learning an adaptive histogram model. In practice, we estimate such a model by iteratively discretizing the continuous data points in the mixture variables. To evaluate the performance of our estimator, we benchmark it against state-of-the-art CMI estimators as well as evaluate it in a causal discovery setting.
翻译:估计有条件的相互信息(CMI)是许多机器学习和数据挖掘任务中重要但具有挑战性的一步。根据含有离散和连续变量,甚至离散和连续混合变量的数据对CMI进行估计是一个特别困难的问题。在本文中,我们表明,根据Radon-Nikodym的衍生物定义的这种混合物变量的CMI可以像CMI的纯离散或连续数据一样,作为种植物的总和来写。此外,我们表明,通过学习适应性直方图模型,CMI可以始终如一地估计离散和连续混合变量的变量。在实践中,我们通过对混合物变量的连续数据点进行迭接来估计这种模型。为了评估我们的测算器的性能,我们用CMI的状态测算器作为基准,并在一个因果发现环境中对其进行评估。