综合估计一般相互信息和应用基因组学 (Ensemble Estimation of Generalized Mutual Information with Applications to Genomics)

from arxiv, Accepted to IEEE Transactions on Information Theory; 42 pages, 3 figures; a shorter version of this paper was published at IEEE ISIT 2017 under the title "Ensemble estimation of mutual information"

Mutual information is a measure of the dependence between random variables that has been used successfully in myriad applications in many fields. Generalized mutual information measures that go beyond classical Shannon mutual information have also received much interest in these applications. We derive the mean squared error convergence rates of kernel density-based plug-in estimators of general mutual information measures between two multidimensional random variables $\mathbf{X}$ and $\mathbf{Y}$ for two cases: 1) $\mathbf{X}$ and $\mathbf{Y}$ are continuous; 2) $\mathbf{X}$ and $\mathbf{Y}$ may have any mixture of discrete and continuous components. Using the derived rates, we propose an ensemble estimator of these information measures called GENIE by taking a weighted sum of the plug-in estimators with varied bandwidths. The resulting ensemble estimators achieve the $1/N$ parametric mean squared error convergence rate when the conditional densities of the continuous variables are sufficiently smooth. To the best of our knowledge, this is the first nonparametric mutual information estimator known to achieve the parametric convergence rate for the mixture case, which frequently arises in applications (e.g. variable selection in classification). The estimator is simple to implement and it uses the solution to an offline convex optimization problem and simple plug-in estimators. A central limit theorem is also derived for the ensemble estimators and minimax rates are derived for the continuous case. We demonstrate the ensemble estimator for the mixed case on simulated data and apply the proposed estimator to analyze gene relationships in single cell data.

翻译：共享信息是衡量随机变量之间依赖性的尺度。在许多字段中, 随机变量在许多应用中成功地应用了。超越古典香农相互信息的普通相互信息测量也对这些应用产生了很大的兴趣。我们得出两个多维随机变量 $\ mathbf{X} 美元和 $\ mathbf{X}Y} 之间一般相互信息测量的正方差混合率。在两个实例中, 1 $\ mathbf{X} 和 $\ mathbf{Y} 之间, 测试随机随机变量之间的依赖性。 2 $\ mathbf{X} 和 $\ mathbreme 共同信息测量率的平均值, 用于在最小的密度关系中, 以 $mathm 和 commanyalfredialal 等值计算结果, 用于在最小值中, 最精确的计算数据用于计算数据。最精确的计算数据, 用于计算模型中, 最精确的计算数据。最精确的计算数据, 至最精确的精确的计算数据, 至最精确的计算数据, 至最精确的精确的计算结果的计算数据, 至最精确的精确的计算数据, 至最精确的计算数据。