Detecting anomalies in large sets of observations is crucial in various applications, such as epidemiological studies, gene expression studies, and systems monitoring. We consider settings where the units of interest result in multiple independent observations from potentially distinct referentials. Scan statistics and related methods are commonly used in such settings, but rely on stringent modeling assumptions for proper calibration. We instead propose a rank-based variant of the higher criticism statistic that only requires independent observations originating from ordered spaces. We show under what conditions the resulting methodology is able to detect the presence of anomalies. These conditions are stated in a general, non-parametric manner, and depend solely on the probabilities of anomalous observations exceeding nominal observations. The analysis requires a refined understanding of the distribution of the ranks under the presence of anomalies, and in particular of the rank-induced dependencies. The methodology is robust against heavy-tailed distributions through the use of ranks. Within the exponential family and a family of convolutional models, we analytically quantify the asymptotic performance of our methodology and the performance of the oracle, and show the difference is small for many common models. Simulations confirm these results. We show the applicability of the methodology through an analysis of quality control data of a pharmaceutical manufacturing process.
翻译:在大规模观测数据中检测异常对于流行病学研究、基因表达分析及系统监控等多种应用至关重要。本文研究场景中,关注单元可能产生来自不同参照系的多个独立观测值。扫描统计量及相关方法在此类场景中常用,但其正确校准依赖于严格的建模假设。我们提出一种基于秩的高阶批评统计量变体,仅需源自有序空间的独立观测值。我们阐明了该方法在何种条件下能够有效检测异常存在。这些条件以非参数化形式表述,仅取决于异常观测值超过正常观测值的概率。该分析需要深入理解异常存在时秩的分布特性,特别是秩诱导的依赖关系。通过使用秩统计量,该方法对重尾分布具有鲁棒性。在指数族和卷积模型族中,我们解析量化了该方法的渐近性能与理想检测器的性能差异,并证明对于许多常见模型该差异较小。仿真实验验证了这些结论。我们通过对制药生产过程质量控制数据的分析,展示了该方法的实际适用性。