In many categorical response regression applications, the response categories admit a multiresolution structure. That is, subsets of the response categories may naturally be combined into coarser response categories. In such applications, practitioners are often interested in estimating the resolution at which a predictor affects the response category probabilities. In this article, we propose a method for fitting the multinomial logistic regression model in high dimensions that addresses this problem in a unified and data-driven way. In particular, our method allows practitioners to identify which predictors distinguish between coarse categories but not fine categories, which predictors distinguish between fine categories, and which predictors are irrelevant. For model fitting, we propose a scalable algorithm that can be applied when the coarse categories are defined by either overlapping or nonoverlapping sets of fine categories. Statistical properties of our method reveal that it can take advantage of this multiresolution structure in a way existing estimators cannot. We use our method to model cell type probabilities as a function of a cell's gene expression profile (i.e., cell type annotation). Our fitted model provides novel biological insights which may be useful for future automated and manual cell type annotation methodology.
翻译:在许多绝对反应回归应用中, 响应类别允许多分辨率结构。 也就是说, 响应类别中的子集自然可以合并成粗化的响应类别。 在这种应用中, 执行人员通常有兴趣估计一个预测者影响响应类别概率的分辨率。 在本条中, 我们提出一个方法, 将多数值后勤回归模型安装在高尺寸上, 以统一和数据驱动的方式解决这个问题。 特别是, 我们的方法允许执行人员确定哪些预测者区分粗略类别, 而不是细数类别, 预测者区分细数类别, 哪些预测者区分细数类别, 哪些预测者与预测者无关。 关于模型的安装, 我们提出一个可缩放算算法, 当粗粗数类别由重叠或非重叠的细类别来界定时, 可以应用。 我们的方法的统计特性表明, 它可以以现有估计者无法的方式利用这一多分辨率结构来解决这个问题。 我们使用的方法来模拟细胞类型概率, 以作为细胞基因表达特征( e. 细胞类型注解) 的函数。 我们的模型提供了新的生物洞察力, 可能有益于未来的自动化和手动细胞类型方法。