Change-point analysis is thriving in this big data era to address problems arising in many fields where massive data sequences are collected to study complicated phenomena over time. It plays an important role in processing these data by segmenting a long sequence into homogeneous parts for follow-up studies. The task requires the method to be able to process large datasets quickly and deal with various types of changes for high-dimensional data. We propose a new approach making use of approximate $k$-nearest neighbor information from the observations, and derive an analytic formula to control the type I error. The time complexity of our proposed method is $O(dn\log n+nk^2)$ for an $n$-length sequence of $d$-dimensional data. The test statistic we consider incorporates a useful pattern for moderate- to high- dimensional data so that the proposed method could detect various types of changes in the sequence. The new approach is also asymptotic distribution-free, facilitating its usage for a broader community. We apply our method to an fMRI dataset and a Neuropixels dataset to illustrate its effectiveness.
翻译:在这个大数据时代,变化点分析正在蓬勃发展,以解决许多领域出现的问题,这些领域收集了大量数据序列,以便长期研究复杂的现象。它通过将一个长序列分成一个同质部分进行后续研究,在处理这些数据方面发挥了重要作用。这项任务要求能够迅速处理大型数据集并处理高维数据的各种变化的方法。我们提出一种新的方法,利用观测中近距离的近邻信息,并产生一种分析公式来控制第一类错误。我们拟议方法的时间复杂性是美元(dn\log n+nk ⁇ 2美元),用于一个以美元计长度序列的美元(d$D$-d$-维数据)。我们考虑的测试数据包含一个有用的中度至高度数据模式,以便拟议的方法能够检测到序列中度的各类变化。我们提出的新方法也是无孔径分布,便于更广泛的社区使用。我们将我们的方法应用于FMRI数据集和Neuropriixels数据集,以说明其有效性。