We present a new non-negative matrix factorization model for $(0,1)$ bounded-support data based on the doubly non-central beta (DNCB) distribution, a generalization of the beta distribution. The expressiveness of the DNCB distribution is particularly useful for modeling DNA methylation datasets, which are typically highly dispersed and multi-modal; however, the model structure is sufficiently general that it can be adapted to many other domains where latent representations of $(0,1)$ bounded-support data are of interest. Although the DNCB distribution lacks a closed-form conjugate prior, several augmentations let us derive an efficient posterior inference algorithm composed entirely of analytic updates. Our model improves out-of-sample predictive performance on both real and synthetic DNA methylation datasets over state-of-the-art methods in bioinformatics. In addition, our model yields meaningful latent representations that accord with existing biological knowledge.
翻译:我们根据双向非中度贝塔(DNCB)分布法,提出了一个新的非负矩阵系数模型(0,1美元)约束支持数据。DNCB分布法的表达性对于DNA甲基化数据集的建模特别有用,DNA甲基化数据集通常是高度分散和多式的;然而,模型结构十分笼统,可以适用于其他许多领域,其中潜在表示值为(0,1美元约束支持数据值得注意)。虽然DNCB分布法在之前缺乏一种封闭式的组合,但一些增强让我们能够得出一种完全由分析性更新法组成的高效的远方推断算法。我们的模型改进了实际和合成DNA甲基化数据集在生物信息学方面最新方法上的模拟性预测性功能。此外,我们的模型还产生了与现有生物知识相一致的有意义的潜在表现法。