Influential diagnosis is an integral part of data analysis, of which most existing methodological frameworks presume a deterministic submodel and are designed for low-dimensional data (i.e., the number of predictors p smaller than the sample size n). However, the stochastic selection of a submodel from high-dimensional data where p exceeds n has become ubiquitous. Thus, methods for identifying observations that could exert undue influence on the choice of a submodel can play an important role in this setting. To date, discussion of this topic has been limited, falling short in two domains: (i) constrained ability to detect multiple influential points, and (ii) applicability only in restrictive settings. After describing the problem, we characterize and formalize the concept of influential observations on variable selection. Then, we propose a generalized diagnostic measure, extended from an available metric accommodating different model selectors and multiple influential observations, the asymptotic distribution of which is subsequently establish large p, thus providing guidelines to ascertain influential observations. A high-dimensional clustering procedure is further incorporated into our proposed scheme to detect multiple influential points. Simulation is conducted to assess the performances of various diagnostic approaches. The proposed procedure further demonstrates its value in improving predictive power when analyzing thermal-stimulated pain based on fMRI data.
翻译:暂无翻译