对可能存在异常数据库的流集数据进行实时回归分析 (Real-Time Regression Analysis of Streaming Clustered Data With Possible Abnormal Data Batches)

This paper develops an incremental learning algorithm based on quadratic inference function (QIF) to analyze streaming datasets with correlated outcomes such as longitudinal data and clustered data. We propose a renewable QIF (RenewQIF) method within a paradigm of renewable estimation and incremental inference, in which parameter estimates are recursively renewed with current data and summary statistics of historical data, but with no use of any historical subject-level raw data. We compare our renewable estimation method with both offline QIF and offline generalized estimating equations (GEE) approach that process the entire cumulative subject-level data, and show theoretically and numerically that our renewable procedure enjoys statistical and computational efficiency. We also propose an approach to diagnose the homogeneity assumption of regression coefficients via a sequential goodness-of-fit test as a screening procedure on occurrences of abnormal data batches. We implement the proposed methodology by expanding existing Spark's Lambda architecture for the operation of statistical inference and data quality diagnosis. We illustrate the proposed methodology by extensive simulation studies and an analysis of streaming car crash datasets from the National Automotive Sampling System-Crashworthiness Data System (NASS CDS).

翻译：本文根据梯度推断函数(QIF)发展了一种渐进式学习算法,以分析流数据集,并分析具有相关结果的流数据集,如纵向数据和集成数据。我们提议在可再生估计和递增推论的范式内采用可再生的QIF(再更新QIF)方法,其中参数估计数与现有数据和历史数据汇总统计反复更新,但没有使用任何历史主题水平原始数据。我们将我们的可再生估算方法与离线QIF和离线通用估计方程式进行比较,处理整个累积主题水平数据,并在理论上和数字上显示我们的可再生程序享有统计和计算效率。我们还提议了一种方法,通过连续的 " 合理性测试 " 来诊断回归系数的同质性假设,作为异常数据批量的筛选程序。我们通过扩大现有的Spark的蓝巴结构,用于统计推断和数据质量诊断。我们通过广泛的模拟研究和分析国家汽车失事数据系统(MS-SARS-D-Sirvicure CD-SD-Systems-Systems-Systemass