This article illustrates intRinsic, an R package that implements novel state-of-the-art likelihood-based estimators of the intrinsic dimension of a dataset, an essential quantity for most dimensionality reduction techniques. In order to make these novel estimators easily accessible, the package contains a small number of high-level functions that rely on a broader set of efficient, low-level routines. Generally speaking, intRinsic encompasses models that fall into two categories: homogeneous and heterogeneous intrinsic dimension estimators. The first category contains the two nearest neighbors estimator, a method derived from the distributional properties of the ratios of the distances between each data point and its first two closest neighbors. The functions dedicated to this method carry out inference under both the frequentist and Bayesian frameworks. In the second category, we find the heterogeneous intrinsic dimension algorithm, a Bayesian mixture model for which an efficient Gibbs sampler is implemented. After presenting the theoretical background, we demonstrate the performance of the models on simulated datasets. This way, we can facilitate the exposition by immediately assessing the validity of the results. Then, we employ the package to study the intrinsic dimension of the Alon dataset, obtained from a famous microarray experiment. Finally, we show how the estimation of homogeneous and heterogeneous intrinsic dimensions allows us to gain valuable insights into the topological structure of a dataset.
翻译:文章在IntRinsic 中展示了一套新型的 R 包,该套套套件使用最新最先进的基于可能性的预估数据集的内在维度,这是大多数维度减少技术的基本数量。为了使这些新颖的估算器容易获得,包件包含少量高层次的功能,这些功能依赖于一套更广泛的高效、低层次的常规。一般而言,IntRisic 包含分为两类的模型: 均匀和异质的内在维度测算器。 第一类包含两个最近的邻居测算器, 一种方法来自每个数据点与第一个最接近的邻里之间的距离比值的分布特性。 用于此方法的功能在经常和巴耶斯框架下进行推断。 在第二类中,我们发现多种的内在维度算法, 一个高效的吉布斯采样器。 在介绍理论背景后, 我们演示了模拟数据集模型的性能。 通过这个方法, 我们可以通过立即评估每个数据点之间距离比对结果的分布性来得出一个方法。 最后, 我们使用一个内基质的精确的精确的模型来展示一个数据结构。