We develop an unsupervised probabilistic model for heterogeneous Electronic Health Record (EHR) data. Utilizing a mixture model formulation, our approach directly models sequences of arbitrary length, such as medications and laboratory results. This allows for subgrouping and incorporation of the dynamics underlying heterogeneous data types. The model consists of a layered set of latent variables that encode underlying structure in the data. These variables represent subject subgroups at the top layer, and unobserved states for sequences in the second layer. We train this model on episodic data from subjects receiving medical care in the Kaiser Permanente Northern California integrated healthcare delivery system. The resulting properties of the trained model generate novel insight from these complex and multifaceted data. In addition, we show how the model can be used to analyze sequences that contribute to assessment of mortality likelihood.
翻译:我们为多种电子健康记录(EHR)数据开发了一个不受监督的概率模型。 我们利用混合模型的配方,我们的方法直接模拟任意长度的序列,如药物和实验室结果。 这样可以对不同数据类型的动态进行分组和整合。 模型由一组分层的潜在变量组成, 将数据的基本结构编码起来。 这些变量代表了顶层的主体分组, 以及第二层序列的未观测状态。 我们训练了这个模型, 是关于接受Kaiser Alberte North California综合医疗提供系统医疗护理的主体的附带数据。 由此形成的经过培训的模型的特性从这些复杂和多方面的数据中产生了新的洞见。 此外, 我们展示了该模型如何用来分析有助于评估死亡率可能性的序列。