大规模数据流中动态统计推论 (Dynamic statistical inference in massive datastreams)

Modern technological advances have expanded the scope of applications requiring analysis of large-scale datastreams that comprise multiple indefinitely long time series. There is an acute need for statistical methodologies that perform online inference and continuously revise the model to reflect the current status of the underlying process. In this manuscript, we propose a dynamic statistical inference framework--named dynamic tracking and screening (DTS)--that is not only able to provide accurate estimates of the underlying parameters in a dynamic statistical model, but also capable of rapidly identifying irregular individual streams whose behavioral patterns deviate from the majority. Concretely, by fully exploiting the sequential feature of datastreams, we develop a robust estimation approach under a framework of varying coefficient model. The procedure naturally accommodates unequally-spaced design points and updates the coefficient estimates as new data arrive without the need to store historical data. A data-driven choice of an optimal smoothing parameter is accordingly proposed. Furthermore, we suggest a new multiple testing procedure tailored to the streaming environment. The resulting DTS scheme is able to adapt time-varying structures appropriately, track changes in the underlying models, and hence maintain high accuracy in detecting time periods during which individual streams exhibit irregular behavior. Moreover, we derive rigorous statistical guarantees of the procedure and investigate its finite-sample performance through simulation studies. We demonstrate the proposed methods through a mobile health example to estimate the timings when subjects' sleep and physical activities have unusual influence upon their mood.

翻译：现代技术进步扩大了需要分析大规模数据流的应用程序范围,这些大规模数据流包含多个无限长的时间序列。我们迫切需要采用统计方法,进行在线推断,并不断修订模型,以反映基础进程的现状。在本稿中,我们提议采用动态统计推论框架-动态跟踪和筛选(DTS)——不仅能够在动态统计模型中提供对基本参数的准确估计,而且能够迅速查明行为模式与大多数数据流不同的非正常个人流。具体地说,通过充分利用数据流的相继特征,我们在不同的系数模型框架内制定强有力的估算方法。该程序自然适应了不均匀的设计点,并随着新数据到达而无需储存历史数据时更新了系数估计数。因此,提议以数据驱动为动力选择一个最优的顺畅参数。此外,我们建议根据动态统计模型环境量身定制的新的多重测试程序。由此产生的DTS计划能够适当调整时间变化结构,跟踪基本模型的变化,从而保持高准确性,从而在不同的系数模型模型模型框架内,在测测测测不同时间段期间的准确度。我们通过模拟方法调查了个人健康状况,从而展示了固定的模拟活动。我们通过模拟模拟程序,我们以调查了不同的统计模式。