Regularized regression models are well studied and, under appropriate conditions, offer fast and statistically interpretable results. However, large data in many applications are heterogeneous in the sense of harboring distributional differences between latent groups. Then, the assumption that the conditional distribution of response Y given features X is the same for all samples may not hold. Furthermore, in scientific applications, the covariance structure of the features may contain important signals and its learning is also affected by latent group structure. We propose a class of mixture models for paired data (X, Y) that couples together the distribution of X (using sparse graphical models) and the conditional Y | X (using sparse regression models). The regression and graphical models are specific to the latent groups and model parameters are estimated jointly (hence the name "regularized joint mixtures"). This allows signals in either or both of the feature distribution and regression model to inform learning of latent structure and provides automatic control of confounding by such structure. Estimation is handled via an expectation-maximization algorithm, whose convergence is established theoretically. We illustrate the key ideas via empirical examples. An R package is available at https://github.com/k-perrakis/regjmix.
翻译:常规回归模型经过周密研究,在适当条件下,提供快速和统计解释的结果。然而,许多应用中的大数据在隐蔽潜在群体分布差异的意义上是多种多样的。然后,假设所有样本的回复Y给定特征X的有条件分布与所有样本的X相同。此外,在科学应用中,特征的共变结构可能包含重要信号,其学习也受潜在群体结构的影响。我们建议了一组混合模型,用于配对数据(X,Y),即X(使用稀薄图形模型)和条件Y ⁇ X(使用稀释回归模型)的分布相交。回归和图形模型模型模型模型是特别针对潜在群体和模型参数的,共同估算(因此名称为“正规化联合混合物”)。这允许特征分布模型和回归模型中的任何一种或两种信号都用于学习潜在结构的信号,并自动控制这种结构的粘结。电动通过预期-氧化算法处理,这种算法是理论上确定的。我们通过实验性例子来说明关键概念。我们通过一个R软件包。