Gaussian processes (GPs) are pervasive in functional data analysis, machine learning, and spatial statistics for modeling complex dependencies. Modern scientific data sets are typically heterogeneous and often contain multiple known discrete subgroups of samples. For example, in genomics applications samples may be grouped according to tissue type or drug exposure. In the modeling process it is desirable to leverage the similarity among groups while accounting for differences between them. While a substantial literature exists for GPs over Euclidean domains $\mathbb{R}^p$, GPs on domains suitable for multi-group data remain less explored. Here, we develop a multi-group Gaussian process (MGGP), which we define on $\mathbb{R}^p\times \mathscr{C}$, where $\mathscr{C}$ is a finite set representing the group label. We provide general methods to construct valid (positive definite) covariance functions on this domain, and we describe algorithms for inference, estimation, and prediction. We perform simulation experiments and apply MGGP to gene expression data to illustrate the behavior and advantages of the MGGP in the joint modeling of continuous and categorical variables.
翻译:用于模拟复杂依赖性的功能性数据分析、机器学习和空间统计过程(GPs)普遍。现代科学数据集通常各异,往往包含多个已知的离散的样本分组。例如,在基因组应用样本中,可以按组织类型或毒品接触情况分组。在模型过程中,可取的是利用各组之间的相似性,同时考虑各组之间的差异。虽然在Euclidean域上存在大量文献,但适用于多组数据领域的GPs $\mathbb{R ⁇ p$,但对适合多组数据的领域的GPs仍然很少进行探讨。在这里,我们开发了一个多组高斯进程(MGP),我们在$\mathb{R ⁇ p\time \mathscr{C}$中定义了这个过程,在模型中,$\mathcr{C}$是代表各组标签的限定数据集。我们提供了构建该领域有效(肯定的)常识变量的一般方法,我们描述了用于推断、估计和预测的算法。我们在GGPMG的模型中进行模拟实验,并应用GGP的精确性变数的优势来说明GPs的连续的模型和精确的变数。