Rate-distortion theory-based outlier detection builds upon the rationale that a good data compression will encode outliers with unique symbols. Based on this rationale, we propose Cluster Purging, which is an extension of clustering-based outlier detection. This extension allows one to assess the representivity of clusterings, and to find data that are best represented by individual unique clusters. We propose two efficient algorithms for performing Cluster Purging, one being parameter-free, while the other algorithm has a parameter that controls representivity estimations, allowing it to be tuned in supervised setups. In an experimental evaluation, we show that Cluster Purging improves upon outliers detected from raw clusterings, and that Cluster Purging competes strongly against state-of-the-art alternatives.
翻译:基于率扭曲理论的外星探测依据的理由是,良好的数据压缩将把独有符号的外星编码起来。 基于这个理由,我们提议集群清理,这是基于集群外星探测的延伸。 这个扩展允许一个人评估集群的代表性,并找到由单个独有集群最能代表的数据。 我们建议两种高效的算法来进行集群清理,一种是无参数的,而另一种算法有一个参数来控制代表性估计,使其能够在监管的设置中进行调控。 在一项实验性评估中,我们显示集群清理改善了从原始集群中检测到的外星,而且集群清理与最先进的替代方法竞争激烈。