We study the sublinear multivariate mean estimation problem in $d$-dimensional Euclidean space. Specifically, we aim to find the mean $\mu$ of a ground point set $A$, which minimizes the sum of squared Euclidean distances of the points in $A$ to $\mu$. We first show that a multiplicative $(1+\varepsilon)$ approximation to $\mu$ can be found with probability $1-\delta$ using $O(\varepsilon^{-1}\log \delta^{-1})$ many independent uniform random samples, and provide a matching lower bound. Furthermore, we give two sublinear time algorithms of optimal sample complexity for extracting a suitable approximate mean: 1. Our first algorithm is based on gradient descent and exploits properties of the geometric median to estimate the mean. It runs in time $O((\varepsilon^{-1}+\log \delta^{-1})\cdot \log \delta^{-1} \cdot d)$. 2. Our second algorithm leverages properties of empirical means order statistics as well as clustering to estimate the mean. This allows to decrease the running time to near-optimal, namely $O\left((\varepsilon^{-1}+\log^{\gamma}\delta^{-1})\cdot \log \delta^{-1} \cdot d\right)$ for any constant $\gamma>0$. Throughout our analysis, we also generalize the familiar median-of-means estimator to the multivariate case, showing that the geometric median of $\log \delta^{-1}$ empirical means well-estimates the mean $\mu$, which may be of independent interest.
翻译:暂无翻译