We develop a novel clustering method for distributional data, where each data point is regarded as a probability distribution on the real line. For distributional data, it has been challenging to develop a clustering method that utilizes the mode of variation of data because the space of probability distributions lacks a vector space structure, preventing the application of existing methods for functional data. In this study, we propose a novel clustering method for distributional data on the real line, which takes account of difference in both the mean and mode of variation structures of clusters, in the spirit of the $k$-centres clustering approach proposed for functional data. Specifically, we consider the space of distributions equipped with the Wasserstein metric and define a geodesic mode of variation of distributional data using geodesic principal component analysis. Then, we utilize the geodesic mode of each cluster to predict the cluster membership of each distribution. We theoretically show the validity of the proposed clustering criterion by studying the probability of correct membership. Through a simulation study and real data application, we demonstrate that the proposed distributional clustering method can improve cluster quality compared to conventional clustering algorithms.
翻译:暂无翻译