We study the problem of constructing coresets for clustering problems with time series data. This problem has gained importance across many fields including biology, medicine, and economics due to the proliferation of sensors facilitating real-time measurement and rapid drop in storage costs. In particular, we consider the setting where the time series data on $N$ entities is generated from a Gaussian mixture model with autocorrelations over $k$ clusters in $\mathbb{R}^d$. Our main contribution is an algorithm to construct coresets for the maximum likelihood objective for this mixture model. Our algorithm is efficient, and under a mild boundedness assumption on the covariance matrices of the underlying Gaussians, the size of the coreset is independent of the number of entities $N$ and the number of observations for each entity, and depends only polynomially on $k$, $d$ and $1/\varepsilon$, where $\varepsilon$ is the error parameter. We empirically assess the performance of our coreset with synthetic data.
翻译:我们研究为时间序列数据组别问题建立核心群集的问题。由于便利实时测量和仓储成本迅速下降的传感器的扩散,这个问题在许多领域,包括生物学、医学和经济领域都变得日益重要。我们特别考虑一个设置,即关于美元实体的时间序列数据来自一个高斯混合模型,该模型以美元为单位对立,以美元为单位,以美元为单位,以美元为单位,以美元为单位,以美元为单位,以美元为单位,以美元为单位,以美元为单位,以美元为单位,以美元为单位,以美元为单位,以美元为单位,以美元为单位。我们算法是高效的,在对基底高斯的共变基体的适度界限假设下,核心群的规模独立于实体的数目和每个实体的观测次数,仅以美元为单位,以美元为单位,以美元为单位,以美元为单位,以瓦列普西隆为单位,以美元为错误参数。我们用合成数据对核心群集的性表现进行实验性评估。