Outlier detection has gained increasing interest in recent years, due to newly emerging technologies and the huge amount of high-dimensional data that are now available. Outlier detection can help practitioners to identify unwanted noise and/or locate interesting abnormal observations. To address this, we developed a novel method for outlier detection for use in, possibly high-dimensional, datasets with both discrete and continuous variables. We exploit the family of decomposable graphical models in order to model the relationship between the variables and use this to form an exact likelihood ratio test for an observation that is considered an outlier. We show that our method outperforms the state-of-the-art Isolation Forest algorithm on a real data example.
翻译:近些年来,由于新兴技术和现有大量高维数据,外部探测越来越引起人们的兴趣。外部探测可以帮助从业者识别不必要的噪音和/或定位有趣的异常观测。为了解决这个问题,我们开发了一种新颖的方法,用于在离散和连续变量的数据集中(可能是高维的)进行外部探测。我们利用可分解的图形模型组合来模拟变量之间的关系,并以此来形成一种精确的可能性比率测试,以进行被视为离谱的观测。我们用一个真实的数据实例来显示我们的方法优于最先进的隔离森林算法。