Recently emerging large-scale biomedical data pose exciting opportunities for scientific discoveries. However, the ultrahigh dimensionality and non-negligible measurement errors in the data may create difficulties in estimation. There are limited methods for high-dimensional covariates with measurement error, that usually require knowledge of the noise distribution and focus on linear or generalized linear models. In this work, we develop high-dimensional measurement error models for a class of Lipschitz loss functions that encompasses logistic regression, hinge loss and quantile regression, among others. Our estimator is designed to minimize the $L_1$ norm among all estimators belonging to suitable feasible sets, without requiring any knowledge of the noise distribution. Subsequently, we generalize these estimators to a Lasso analog version that is computationally scalable to higher dimensions. We derive theoretical guarantees in terms of finite sample statistical error bounds and sign consistency, even when the dimensionality increases exponentially with the sample size. Extensive simulation studies demonstrate superior performance compared to existing methods in classification and quantile regression problems. An application to a gender classification task based on brain functional connectivity in the Human Connectome Project data illustrates improved accuracy under our approach, and the ability to reliably identify significant brain connections that drive gender differences.
翻译:最近出现的大规模生物医学数据为科学发现带来了令人兴奋的机会。然而,数据中的超高度维度和不可忽略的测量误差可能会造成估算方面的困难。测量误差的高维共变方法有限,通常需要了解噪音分布,并侧重于线性或广度线性模型。在这项工作中,我们为包括后勤回归、临界损失和四分位回归在内的利普施奇茨损失功能类别开发高维度测量误差模型。我们的估测器旨在将属于适当可行机组的所有估测员的1美元标准降到最低,而不需要对噪音分布的任何了解。随后,我们将这些估测器推广到可计算到更高尺寸的Lasso模拟版本。我们从有限的抽样统计误差的界限和符号一致性方面获得理论上的保证,即使其尺寸随抽样规模而急剧上升。广泛的模拟研究表明,与现有的分类方法和孔化回归问题相比,其业绩优于现有的方法。在人类连接项目中基于大脑功能连接的性别分类任务上应用了性别分类任务。随后,我们将这些估测算到可精确地显示我们大脑连接能力的差异。