This paper considers the detection of change points in parallel data streams, a problem widely encountered when analyzing large-scale real-time streaming data. Each stream may have its own change point, at which its data has a distributional change. With sequentially observed data, a decision maker needs to declare whether changes have already occurred to the streams at each time point.Once a stream is declared to have changed, it is deactivated permanently so that its future data will no longer be collected. This is a compound decision problem in the sense that the decision maker may want to optimize certain compound performance metrics that concern all the streams as a whole. Thus, the decisions are not independent for different streams. Our contribution is three-fold. First, we propose a general framework for compound performance metrics that includes the ones considered in the existing works as special cases and introduces new ones that connect closely with the performance metrics for single-stream sequential change detection and large-scale hypothesis testing. Second, data-driven decision procedures are developed under this framework. Finally, optimality results are established for the proposed decision procedures. The proposed methods and theory are evaluated by simulation studies and a case study.
翻译:本文审议了平行数据流的变化点的探测问题,这是分析大规模实时流数据时广泛遇到的一个问题。 每个流可能有自己的变化点, 其数据在其中发生分布变化。 由于按顺序观察的数据, 决策者需要声明流在每一时间点是否已经发生变化。 一旦一个流被宣布已经发生变化, 就会永久停止使用, 以便将来不再收集它的数据。 这是一个复杂的决定问题, 即决策者可能想要优化涉及所有流的某些复合性能指标。 因此, 决定并不是针对不同流的。 我们的贡献是三重的。 首先, 我们提议了一个复合性能指标总框架, 其中包括现有工作中作为特殊情况考虑的那些指标, 并引入新的框架, 与单流顺序变化探测和大规模假设测试的性能指标密切关联。 其次, 在这个框架之下制定数据驱动的决策程序。 最后, 将确定拟议决定程序的最佳性结果。 拟议的方法和理论通过模拟研究和案例研究加以评估。