In many applications, it is often of practical and scientific interest to detect anomaly events in a streaming sequence of high-dimensional or non-Euclidean observations. We study a non-parametric framework that utilizes nearest neighbor information among the observations to detect changes in an online setting. It can be applied to data in arbitrary dimension and non-Euclidean data as long as a similarity measure on the sample space can be defined. We consider new test statistics under this framework that can detect anomaly events more effectively than the existing test while keeping the false discovery rate controlled at a fixed level. Analytic formulas approximating the average run lengths of the new approaches are derived to make them fast applicable to modern datasets. Simulation studies are provided to support theoretical results. The proposed approach is illustrated with an analysis of the NYC taxi dataset.
翻译:在许多应用中,在高维或非欧裔观测的串流序列中发现异常事件往往具有实际和科学意义。我们研究一个非参数框架,利用观测中最近的邻居信息探测在线环境的变化。只要能够确定关于抽样空间的类似度度度,即可适用于任意尺寸和非欧裔数据。我们考虑在这一框架内新的测试统计,这种统计能够比现有测试更有效地探测异常事件,同时将虚假的发现率控制在固定水平上。分析公式可以得出接近新方法平均运行长度的公式,使其迅速适用于现代数据集。提供模拟研究,以支持理论结果。对纽约州出租车数据集的分析说明了拟议方法。