Isolation forest or "iForest" is an intuitive and widely used algorithm for anomaly detection that follows a simple yet effective idea: in a given data distribution, if a threshold (split point) is selected uniformly at random within the range of some variable and data points are divided according to whether they are greater or smaller than this threshold, outlier points are more likely to end up alone or in the smaller partition. The original procedure suggested the choice of variable to split and split point within a variable to be done uniformly at random at each step, but this paper shows that "clustered" diverse outliers - oftentimes a more interesting class of outliers than others - can be more easily identified by applying a non-uniformly-random choice of variables and/or thresholds. Different split guiding criteria are compared and some are found to result in significantly better outlier discrimination for certain classes of outliers.
翻译:隔离森林或“ 森林” 或“ 森林” 是一种直觉和广泛使用的异常现象检测算法,遵循简单而有效的理念:在特定的数据分布中,如果在某些变量范围内统一随机选择一个阈值(分点),并且数据点根据是否大于或小于该阈值而分割,则偏差点更有可能单独结束,或者在较小的分区中结束。原始程序建议选择变量,在变量中进行分裂和分点,每个步骤都要统一随机完成,但本文显示,“集聚”的外层——往往比其他外层更有趣——通过对变量和/或阈值进行非统一的随机选择,可以更容易地识别。对不同的区分指导标准进行比较,并发现某些外端类别有明显更好的外向歧视。