Outlying observations are frequently encountered in a wide spectrum of scientific domains, posing significant challenges for the generalizability of statistical models and the reproducibility of downstream analysis. These observations can be identified through influential diagnosis, which refers to the detection of observations that are unduly influential on diverse facets of statistical inference. To date, methods for identifying observations influencing the choice of a stochastically selected submodel have been underdeveloped, especially in the high-dimensional setting where the number of predictors p exceeds the sample size n. Recently we proposed an improved diagnostic measure to handle this setting. However, its distributional properties and approximations have not yet been explored. To address this shortcoming, the notion of exchangeability is revived, and used to determine the exact finite- and large-sample distributions of our assessment metric. This forms the foundation for the introduction of both parametric and non-parametric approaches for its approximation and the establishment of thresholds for diagnosis. The resulting framework is extended to logistic regression models, followed by a simulation study conducted to assess the performance of various detection procedures. Finally the framework is applied to data from an fMRI study of thermal pain, with the goal of identifying outlying subjects that could distort the formulation of statistical models using functional brain activity in predicting physical pain ratings. Both linear and logistic regression models are used to demonstrate the benefits of detection and compare the performances of different detection procedures. In particular, two additional influential observations are identified, which are not discovered by previous studies.
翻译:暂无翻译