The Gaussian kernel and its traditional normalizations (e.g., row-stochastic) are popular approaches for assessing similarities between data points, commonly used for manifold learning and clustering, as well as supervised and semi-supervised learning on graphs. In many practical situations, the data can be corrupted by noise that prohibits traditional affinity matrices from correctly assessing similarities, especially if the noise magnitudes vary considerably across the data, e.g., under heteroskedasticity or outliers. An alternative approach that provides a more stable behavior under noise is the doubly stochastic normalization of the Gaussian kernel. In this work, we investigate this normalization in a setting where points are sampled from an unknown density on a low-dimensional manifold embedded in high-dimensional space and corrupted by possibly strong, non-identically distributed, sub-Gaussian noise. We establish the pointwise concentration of the doubly stochastic affinity matrix and its scaling factors around certain population forms. We then utilize these results to develop several tools for robust inference. First, we derive a robust density estimator that can substantially outperform the standard kernel density estimator under high-dimensional noise. Second, we provide estimators for the pointwise noise magnitudes, the pointwise signal magnitudes, and the pairwise Euclidean distances between clean data points. Lastly, we derive robust graph Laplacian normalizations that approximate popular manifold Laplacians, including the Laplace Beltrami operator, showing that the local geometry of the manifold can be recovered under high-dimensional noise. We exemplify our results in simulations and on real single-cell RNA-sequencing data. In the latter, we show that our proposed normalizations are robust to technical variability associated with different cell types.
翻译:高斯内核及其传统的正统性(例如,行式正正统性)是评估数据点之间相似性的流行方法,通常用于多重学习和组合,以及监督和半监督的图形学习。在许多实际情况下,数据可能因为噪音而腐蚀,因为噪音禁止传统的亲和矩阵正确评估相似性,特别是如果在数据中,噪音大小差异很大,例如,在心电图或离线下;另一种在噪音下提供更稳定行为的替代方法,是高斯内核的双向性正统性正常化。在这个工作中,我们在从高体空间内一个未知的低维密度中抽取点,进行这种正常化的采样,并被可能非常强的、非奇异分布、亚伽西安亚噪音所腐蚀。我们用双色的直径直线性粘度矩阵和某些人口形式的伸缩性因素,我们然后利用这些结果来开发一些相关工具,以坚固的直径直径直的离心根基内核内核。首先,我们从一个未知的低位数据型的直方位上,我们从一个不动的直方位的直方位的直方位数据直方位上,我们从一个直方向的直方位的直方位的直方位的直方位的平方位的平方位的平方位的平方位的平方位数据显示了一个直方位数据。