In many applications, data can be heterogeneous in the sense of spanning latent groups with different underlying distributions. When predictive models are applied to such data the heterogeneity can affect both predictive performance and interpretability. Building on developments at the intersection of unsupervised learning and regularised regression, we propose an approach for heterogeneous data that allows joint learning of (i) explicit multivariate feature distributions, (ii) high-dimensional regression models and (iii) latent group labels, with both (i) and (ii) specific to latent groups and both elements informing (iii). The approach is demonstrably effective in high dimensions, combining data reduction for computational efficiency with a re-weighting scheme that retains key signals even when the number of features is large. We discuss in detail these aspects and their impact on modelling and computation, including EM convergence. The approach is modular and allows incorporation of data reductions and high-dimensional estimators that are suitable for specific applications. We show results from extensive simulations and real data experiments, including highly non-Gaussian data. Our results allow efficient, effective analysis of high-dimensional data in settings, such as biomedicine, where both interpretable prediction and explicit feature space models are needed but hidden heterogeneity may be a concern.
翻译:在许多应用中,数据在分布基础不同的潜在群落的含义上是多种多样的。当预测模型应用于这些数据时,异质性既影响预测性,又影响可解释性。根据在未经监督的学习和常规回归交汇处的发展动态,我们建议了一种混合数据方法,以便共同学习:(一) 明确的多变量分布,(二) 高维回归模型和(三) 潜在群落标签,其中(一) 和(二) 具体针对潜在群落和两个要素的信息(三) 。这种方法在高维度方面明显有效,将计算效率的数据减少与保留关键信号的再加权计划相结合,即使特征数量很大。我们详细讨论这些方面及其对建模和计算的影响,包括EM趋同。这种方法是模块化的,并允许纳入适合具体应用的数据减少和高维估测值。我们展示了广泛模拟和真实数据实验的结果,包括高度非伽西文数据。我们的结果允许高效、有效地分析高度数据,但可在生物医学中进行隐性预测的高度空间模型。