This paper addresses the case where data come as point sets, or more generally as discrete measures. Our motivation is twofold: first we intend to approximate with a compactly supported measure the mean of the measure generating process, that coincides with the intensity measure in the point process framework, or with the expected persistence diagram in the framework of persistence-based topological data analysis. To this aim we provide two algorithms that we prove almost minimax optimal. Second we build from the estimator of the mean measure a vectorization map, that sends every measure into a finite-dimensional Euclidean space, and investigate its properties through a clustering-oriented lens. In a nutshell, we show that in a mixture of measure generating process, our technique yields a representation in $\mathbb{R}^k$, for $k \in \mathbb{N}^*$ that guarantees a good clustering of the data points with high probability. Interestingly, our results apply in the framework of persistence-based shape classification via the ATOL procedure described in \cite{Royer19}.
翻译:本文论述数据作为点集或更一般地作为离散措施出现的情况。 我们的动机是双重的: 首先,我们打算以一个紧密支持的量度测量生成量过程的平均值, 与点过程框架的强度测量值相吻合, 或者与基于持久性的表层数据分析框架中的预期持久性图相吻合。 为此,我们提供了两种我们证明几乎是微缩最大最佳的算法。 其次,我们从平均值的测量仪中建立一种矢量化图,将每个量度都发送到一个有限维的 Eucliidean 空间, 并通过一个面向集群的镜头来调查其特性。 简而言之, 我们用一种量度生成过程的混合体显示, 我们的技术以$\mathb{R ⁇ k$产生一个代表, 美元, 保证数据点有很高的概率。 有趣的是, 我们的结果适用于通过在\ cite{Royer19}中描述的 的 ATOL 形状分类框架。