We analyze a large database of de-identified Medicare Advantage claims from a single large US health insurance provider, where the number of individuals available for analysis are an order of magnitude larger than the number of potential covariates. This type of data, dubbed `tall data', often does not fit in memory, and estimating parameters using traditional Markov Chain Monte Carlo (MCMC) methods is a computationally infeasible task. We show how divide-and-conquer MCMC, which splits the data into disjoint subsamples and runs a MCMC algorithm on each sample in parallel before combining results, can be used with a multivariate probit factor model. We then show how this approach can be applied to large medical datasets to provide insights into questions of interest to the medical community. We also conduct a simulation study, comparing two posterior combination algorithms with a mean-field stochastic variational approach, showing that divide-and-conquer MCMC should be preferred over variational inference when estimating the latent correlation structure between binary responses is of primary interest.
翻译:我们分析一个单一的大型美国医疗保险提供商提供的关于确定医疗福利索赔的大型数据库,其中可供分析的人数数量比潜在共同变量的数量要大得多。这类称为“全部数据”的数据往往不适应记忆,并且使用传统的Markov 链条蒙特卡洛(MCMC)方法估算参数,这是一项计算不可行的任务。我们展示了如何将数据分为分解子样本并同时对每个样本同时进行MCMC算法的分解和合并结果,在计算结果之前,可以同时使用多变量的分辨参数模型。然后我们展示了如何将这种方法应用于大型医疗数据集,以深入了解医学界感兴趣的问题。我们还进行了模拟研究,将两种后种混合算法与平均场的分解变法进行比较,表明在估计二进制反应之间潜在的相关性结构时,偏好偏向偏向偏向偏向偏向偏向偏向偏向偏向偏向差异的混合模型。