用于多变量二进制数据的 MCMC MMC (Divide-and-Conquer MCMC for Multivariate Binary Data)

The analysis of large scale medical claims data has the potential to improve quality of care by generating insights which can be used to create tailored medical programs. In particular, the multivariate probit model can be used to investigate the correlation between multiple binary responses of interest in such data, e.g. the presence of multiple chronic conditions. Bayesian modeling is well suited to such analyses because of the automatic uncertainty quantification provided by the posterior distribution. A complicating factor is that large medical claims datasets often do not fit in memory, which renders the estimation of the posterior using traditional Markov Chain Monte Carlo (MCMC) methods computationally infeasible. To address this challenge, we extend existing divide-and-conquer MCMC algorithms to the multivariate probit model, demonstrating, via simulation, that they should be preferred over mean-field variational inference when the estimation of the latent correlation structure between binary responses is of primary interest. We apply this algorithm to a large database of de-identified Medicare Advantage claims from a single large US health insurance provider, where we find medically meaningful groupings of common chronic conditions and asses the impact of the urban-rural health gap by identifying underutilized provider specialties in rural areas.

翻译：对大规模医疗索赔数据的分析有可能提高护理质量,通过提供可用于建立量身定做的医疗程序的洞察力,提高护理质量。特别是,可以使用多种变式的保质模型来调查对此类数据感兴趣的多种二元反应(例如存在多种慢性病)之间的相互关系。贝叶斯模型非常适合进行此类分析,因为后天分布提供的自动不确定性量化。一个复杂因素是,大型医疗索赔数据集往往不符合记忆,这使得利用传统的马克夫连锁蒙特卡洛(MCMC)方法来计算后遗症的估计数变得不可行。为了应对这一挑战,我们通过模拟,将现有的差异和quer MC的算法扩大到多变式保质模型,表明在对二进因反应之间潜在关联结构的估计具有首要意义时,应当比中位的差。我们把这一算法应用到一个大型数据库中不确定的Medicare Advantage索赔要求,这个数据库是美国一个单一的大型医疗保险提供商(Medicare Advantage)方法无法进行计算。为了应对这一挑战,我们通过模拟,将现有的差异和城乡特殊健康差距下,我们发现特殊的农村供应者在特殊的农村区域的影响。