The identification of influential observations is an important part of data analysis that can prevent erroneous conclusions drawn from biased estimators. However, in high dimensional data, this identification is challenging. Classical and recently-developed methods often perform poorly when there are multiple influential observations in the same dataset. In particular, current methods can fail when there is masking several influential observations with similar characteristics, or swamping when the influential observations are near the boundary of the space spanned by well-behaved observations. Therefore, we propose an algorithm-based, multi-step, multiple detection procedure to identify influential observations that addresses current limitations. Our three-step algorithm to identify and capture undesirable variability in the data, $\asymMIP,$ is based on two complementary statistics, inspired by asymmetric correlations, and built on expectiles. Simulations demonstrate higher detection power than competing methods. Use of the resulting asymptotic distribution leads to detection of influential observations without the need for computationally demanding procedures such as the bootstrap. The application of our method to the Autism Brain Imaging Data Exchange neuroimaging dataset resulted in a more balanced and accurate prediction of brain maturity based on cortical thickness. See our GitHub for a free R package that implements our algorithm: \texttt{asymMIP} (\url{github.com/AmBarry/hidetify}).
翻译:确定有影响的观测是数据分析的一个重要部分,它可以防止从偏差的测算器得出错误的结论。然而,在高维数据中,这一识别具有挑战性。当同一数据集中存在多重有影响的观测时,经典和最近开发的方法往往效果不佳。特别是,当有类似特点的有影响的观测掩盖一些有影响的观测时,或当有影响力的观测接近以稳妥的观测所覆盖的空间边界时,现有方法可能会失败。因此,我们建议采用基于算法的、多步骤、多探测程序,以找出解决当前限制的有影响的观测。我们用来查明和捕捉数据中不可取的变异性的三步算法{asymMIP,$基于两个互补统计数据,受不对称关联的启发,并以预期值为基础。模拟显示的探测能力高于竞争性的方法。因此,使用有影响力的分布法可以探测有影响的观测,而不需要计算严格的程序,如靴子陷阱。我们的方法应用于Autisicaliming Data交换神经数据集成的三步算法。 以更平衡和准确的模型预测我们的大脑成熟度模型。