Three-way data structures, characterized by three entities, the units, the variables and the occasions, are frequent in biological studies. In RNA sequencing, three-way data structures are obtained when high-throughput transcriptome sequencing data are collected for $n$ genes across $p$ conditions at $r$ occasions. Matrix variate distributions offer a natural way to model three-way data and mixtures of matrix variate distributions can be used to cluster three-way data. Clustering of gene expression data is carried out as means of discovering gene co-expression networks. In this work, a mixture of matrix variate Poisson-log normal distributions is proposed for clustering read counts from RNA sequencing. By considering the matrix variate structure, full information on the conditions and occasions of the RNA sequencing dataset is simultaneously considered, and the number of covariance parameters to be estimated is reduced. We propose three different frameworks for parameter estimation: a Markov chain Monte Carlo based approach, a variational Gaussian approximation based approach, and a hybrid approach. Various information criteria are used for model selection. The models are applied to both real and simulated data, and we demonstrate that the proposed approaches can recover the underlying cluster structure in both cases. In simulation studies where the true model parameters are known, our proposed approach shows good parameter recovery.
翻译:在生物研究中,以三个实体、单位、变量和场合为特征的三向数据结构经常出现。在RNA测序中,在收集高通量转录式序列数据时获得三向数据结构。矩阵变式分布为模拟三向数据和矩阵变异分布混合物的自然方法,可用于组合三向数据。基因表达数据分组,作为发现基因共表达网络的手段。在这项工作中,提议混合矩阵变异式 Poisson-log 正常分布,用于组集RNA测序的读数。通过考虑矩阵变异结构,同时考虑RNA测序数据集条件和时间的全部信息,并减少估计的变异参数数目。我们提出了三个不同的参数估计框架:Markov链 Monte Carlo方法、基于变式测算法的方法和混合方法。各种信息标准用于从RNA测序中进行组合组合组合。通过考虑矩阵变异式结构,同时考虑RNA测序数据集条件和时间的全部信息,同时考虑RNA测序参数。我们提出的模型既能显示真正的恢复方法,又能显示我们提出的模型的模拟实例。