以未知截断点建立高多元数据模型:合并惩罚性物流阈值递减 (Modeling High-Dimensional Data with Unknown Cut Points: A Fusion Penalized Logistic Threshold Regression)

In traditional logistic regression models, the link function is often assumed to be linear and continuous in predictors. Here, we consider a threshold model that all continuous features are discretized into ordinal levels, which further determine the binary responses. Both the threshold points and regression coefficients are unknown and to be estimated. For high dimensional data, we propose a fusion penalized logistic threshold regression (FILTER) model, where a fused lasso penalty is employed to control the total variation and shrink the coefficients to zero as a method of variable selection. Under mild conditions on the estimate of unknown threshold points, we establish the non-asymptotic error bound for coefficient estimation and the model selection consistency. With a careful characterization of the error propagation, we have also shown that the tree-based method, such as CART, fulfill the threshold estimation conditions. We find the FILTER model is well suited in the problem of early detection and prediction for chronic disease like diabetes, using physical examination data. The finite sample behavior of our proposed method are also explored and compared with extensive Monte Carlo studies, which supports our theoretical discoveries.

翻译：在传统的物流回归模型中,链接功能通常被假定为线性且连续的预测值。这里, 我们考虑一个阈值模型, 所有连续特征都分解成正态值, 以进一步确定二进制反应。阈值和回归系数都未知, 并且要估算。对于高维数据, 我们建议采用一个受约束的物流阈值回归(FILTER)模型(FILTER)模型(FILTER), 使用结合的拉索惩罚来控制总变数, 并将系数减为零, 作为变量选择方法。在对未知阈值的估计的温和条件下, 我们确定了非非同步差值错误, 并确定了模型选择的一致性。在对错误传播进行仔细定性后, 我们还表明, 树基方法, 如CART, 符合阈值估计条件。我们发现FILTER模型非常适合使用物理检查数据来控制糖尿病等慢性病的早期检测和预测问题。我们还探索了我们拟议方法的有限抽样行为, 并与广泛的蒙特卡洛研究相比较, 支持我们的理论发现。