High dimensional categorical data are routinely collected in biomedical and social sciences. It is of great importance to build interpretable parsimonious models that perform dimension reduction and uncover meaningful latent structures from such discrete data. Identifiability is a fundamental requirement for valid modeling and inference in such scenarios, yet is challenging to address when there are complex latent structures. In this article, we propose a class of identifiable multilayer (potentially deep) discrete latent structure models for discrete data, termed Bayesian pyramids. We establish the identifiability of Bayesian pyramids by developing novel transparent conditions on the pyramid-shaped deep latent directed graph. The proposed identifiability conditions can ensure Bayesian posterior consistency under suitable priors. As an illustration, we consider the two-latent-layer model and propose a Bayesian shrinkage estimation approach. Simulation results for this model corroborate the identifiability and estimability of model parameters. Applications of the methodology to DNA nucleotide sequence data uncover useful discrete latent features that are highly predictive of sequence types. The proposed framework provides a recipe for interpretable unsupervised learning of discrete data, and can be a useful alternative to popular machine learning methods.
翻译:生物医学和社会科学定期收集高维绝对数据; 十分重要的是,要建立可解释、可解释、可降低维度和从这种离散数据中发现有意义的潜伏结构的模糊模型; 在这种假设情景中,可辨识性是有效建模和推断的基本要求,但在存在复杂的潜伏结构时,却难以解决。在本篇文章中,我们建议为离散数据建立一组可识别的多层(潜在深层)离散潜伏结构模型,称为巴伊西亚金字塔; 我们通过在金字塔-深潜深潜定向图上开发新的透明条件,确定巴伊西亚金字塔金字塔的可辨识性; 提议的可辨性条件可以在适当的前期确保巴伊西亚后方后方的一致性。 举例而言,我们考虑双向层模型,并提出贝伊的缩影估计方法。 这一模型的模拟结果证实了模型的可辨识识性和可辨性和估计性。 将DNA核分裂序列数据应用于发现有用的离层潜在特征,这些特征是高度预测的序列类型的。 提议的可辨性框架提供了一种可解释的解离式学习的离式数据的方法。