Real world-datasets characterized by discrete features are ubiquitous: from categorical surveys to clinical questionnaires, from unweighted networks to DNA sequences. Nevertheless, the most common unsupervised dimensional reduction methods are designed for continuous spaces, and their use for discrete spaces can lead to errors and biases. In this letter we introduce an algorithm to infer the intrinsic dimension (ID) of datasets embedded in discrete spaces. We demonstrate its accuracy on benchmark datasets, and we apply it to analyze a metagenomic dataset for species fingerprinting, finding a surprisingly small ID, of order 2. This suggests that evolutive pressure acts on a low-dimensional manifold despite the high-dimensionality of sequences' space.
翻译:以离散特征为特征的真正世界数据集无处不在:从绝对调查到临床问卷,从无加权网络到DNA序列。然而,最常见的未经监督的维衰减方法是为连续空间设计的,它们用于离散空间可能导致错误和偏差。在本信中,我们引入了一种算法来推断离散空间内嵌数据集的内在维度(ID)。我们在基准数据集上展示了该数据集的准确性,我们运用它来分析用于物种指纹的代谢数据集,发现第2号命令的奇特小的识别码。这表明,尽管序列空间具有高维度,但演进式压力对低维数作用。