This paper presents a fast methodology, called ROBOUT, to identify outliers in a response variable conditional on a set of linearly related predictors, retrieved from a large granular dataset. ROBOUT is shown to be effective and particularly versatile compared to existing methods in the presence of a number of data idiosyncratic features. ROBOUT is able to identify observations with outlying conditional variance when the dataset contains element-wise sparse variables, and the set of predictors contains multivariate outliers. Existing integrated methodologies like SPARSE-LTS and RLARS are systematically sub-optimal under those conditions. ROBOUT entails a robust selection stage of the statistically relevant predictors (by using a Huber or a quantile loss), the estimation of a robust regression model based on the selected predictors (by LTS, GS or MM), and a criterion to identify conditional outliers based on a robust measure of the residuals' dispersion. We conduct a comprehensive simulation study in which the different variants of the proposed algorithm are tested under an exhaustive set of different perturbation scenarios. The methodology is also applied to a granular supervisory banking dataset collected by the European Central Bank.
翻译:本文介绍了一种快速的方法,称为ROBOUT,以从大型颗粒数据集中检索到的一组线性相关预测器为条件,确定响应变量中的异常值。ROBOUT显示,与现有方法相比,在存在一些数据特点的情况下,ROBOUT与现有方法相比是有效而且特别多功能的。ROBOUT能够在数据集包含元素偏少变量时确定观测结果,而预测器组包含多变量。在这些条件下,SPARSE-LTS和RLARRS等现有综合方法是系统性的次最佳方法。ROBOUT包含一个与统计相关的预测器的强有力选择阶段(使用Huber或量值损失),根据选定的预测器(用LTS、GS或MMM)估算稳健的回归模型,以及根据对残余分布的稳健度测量确定有条件外部值的标准。我们进行一项全面的模拟研究,在这些条件下,根据不同的扰动假设情景,对拟议算法的不同变量进行系统的测试。该方法还应用了由欧洲监督银行收集的谷仓数据。