The estimation of the intrinsic dimension of a dataset is a fundamental step in most dimensionality reduction techniques. This article illustrates intRinsic, an R package that implements novel state-of-the-art likelihood-based estimators of the intrinsic dimension of a dataset. In order to make these novel estimators easily accessible, the package contains a small number of high-level functions that rely on a broader set of efficient, low-level routines. Generally speaking, intRinsic encompasses models that fall into two categories: homogeneous and heterogeneous intrinsic dimension estimators. The first category contains the TWO-NN model, an estimator derived from the distributional properties of the ratios of the distances between each data point and its first two of nearest neighbors. The functions dedicated to this method carry out inference under both the frequentist and Bayesian frameworks. In the second category, we find Hidalgo, a Bayesian mixture model, for which an efficient Gibbs sampler is implemented. After presenting the theoretical background, we demonstrate the performance of the models on simulated datasets. This way, we can facilitate the exposition by immediately assessing the validity of the results. Then, we employ the package to study the intrinsic dimension of the Alon dataset, obtained from a famous microarray experiment. We show how the estimation of homogeneous and heterogeneous intrinsic dimensions allows us to gain valuable insights into the topological structure of a dataset.
翻译:估算数据集的内在维度是大多数维维度的减少技术中的一个基本步骤。 文章展示了Int Rinsic 的内在维度, 这是一种R 的包件, 其执行对数据集的内在维度进行新的最先进的基于概率的估算。 为了使这些新颖的估算器易于获取, 包件包含少量高层次的功能, 依赖于一套更广泛的高效、 低层次的常规。 一般来说, 内源包含分为两类的模型: 同一和混杂的内在维度估计器。 第一类包含 2- NN 模型, 一种根据每个数据点与第一个近邻之间距离比值的分布属性得出的估计。 为了便于对每个数据点之间的距离的分布特性进行估计, 用于这种方法的功能在经常和巴耶西亚框架下进行推断。 在第二类中, 我们发现Bayesian混合物模型, 即一个高效的 Gibs 取样器。 在介绍理论背景后, 我们演示模型在模拟数据集的性能表现。 这样, 我们就可以将一个真实的内源性数据推到后, 我们就可以将一个原始的内流数据推到后, 。