Researchers often have to deal with heterogeneous population with mixed regression relationships, increasingly so in the era of data explosion. In such problems, when there are many candidate predictors, it is not only of interest to identify the predictors that are associated with the outcome, but also to distinguish the true sources of heterogeneity, i.e., to identify the predictors that have different effects among the clusters and thus are the true contributors to the formation of the clusters. We clarify the concepts of the source of heterogeneity that account for potential scale differences of the clusters and propose a regularized finite mixture effects regression to achieve heterogeneity pursuit and feature selection simultaneously. As the name suggests, the problem is formulated under an effects-model parameterization, in which the cluster labels are missing and the effect of each predictor on the outcome is decomposed to a common effect term and a set of cluster-specific terms. A constrained sparse estimation of these effects leads to the identification of both the variables with common effects and those with heterogeneous effects. We propose an efficient algorithm and show that our approach can achieve both estimation and selection consistency. Simulation studies further demonstrate the effectiveness of our method under various practical scenarios. Three applications are presented, namely, an imaging genetics study for linking genetic factors and brain neuroimaging traits in Alzheimer's disease, a public health study for exploring the association between suicide risk among adolescents and their school district characteristics, and a sport analytics study for understanding how the salary levels of baseball players are associated with their performance and contractual status.
翻译:在这些问题中,当有许多候选预测器时,人们不仅有兴趣确定与结果相关的预测器,而且有兴趣区分异质性的真正来源,即确定在组群中具有不同影响的预测器,从而成为组成组群的真正贡献者。我们澄清了异质性源的概念,这种异质性源是各组群之间潜在规模差异的原因,并提出一种固定化的有限混合效应回归,以同时实现异质追求和特征选择。正如名称所示,问题是在效果模型参数参数参数参数下形成的,其中缺少群集标签,而且每个预测器对结果的影响与共同效应术语和一组具体术语脱钩。对这些影响的有限估计导致查明具有共同效应的变量和具有混杂效应的变量。我们建议一种高效的算法,并表明我们的方法可以实现估算和选择一致性,同时实现异异质追求和特征选择。正如名称所示,问题是在效果模型参数参数参数参数参数下形成的,其中缺少群集标签标签标签标签,每个组群群群群群群群体对结果的影响与一组不同影响和一组组群集的工的影响。对这些影响进行有限的估计,我们的方法在研究中,一种实际的遗传性研究中,一种方法在研究中和大脑结构结构上是它们之间是如何联系的。