The Cox model is an indispensable tool for time-to-event analysis, particularly in biomedical research. However, medicine is undergoing a profound transformation, generating data at an unprecedented scale, which opens new frontiers to study and understand diseases. With the wealth of data collected, new challenges for statistical inference arise, as datasets are often high dimensional, exhibit an increasing number of measurements at irregularly spaced time points, and are simply too large to fit in memory. Many current implementations for time-to-event analysis are ill-suited for these problems as inference is computationally demanding and requires access to the full data at once. Here we propose a Bayesian version for the counting process representation of Cox's partial likelihood for efficient inference on large-scale datasets with millions of data points and thousands of time-dependent covariates. Through the combination of stochastic variational inference and a reweighting of the log-likelihood, we obtain an approximation for the posterior distribution that factorizes over subsamples of the data, enabling the analysis in big data settings. Crucially, the method produces viable uncertainty estimates for large-scale and high-dimensional datasets. We show the utility of our method through a simulation study and an application to myocardial infarction in the UK Biobank.
翻译:Cox模型是进行时间到活动分析,特别是生物医学研究方面进行时间到活动分析的不可或缺的工具。然而,医学正在经历一场深刻的转变,以前所未有的规模生成数据,这为研究和理解疾病开辟了新的前沿。由于收集的数据丰富,统计推论的新挑战出现,因为数据集往往是高维的,在非正常时空时间点显示越来越多的测量数据,而且过于庞大,难以记忆。许多当前时间到活动分析的实施不适合这些问题,因为推断要求计算,需要同时获得全部数据。我们在这里提议了一个Bayesian版本,用于计算Cox的计算过程,以部分方式表示Cox对具有数百万个数据点和数千个时间依赖的共变数的大型数据集进行有效推断的可能性。我们通过随机变异的推断和对日志相似性分析的结合,我们获得了一个近似分布的近似值,该分布将数据分解为数据子抽样,使分析能够在大数据环境中进行。我们用高维值分析,我们用高维度的模型来分析。我们用高空的模型来研究,我们用一种可靠的方法,用高空空的模型来研究。