Outlier detection, crucial for identifying unusual patterns with significant implications across numerous applications, has drawn considerable research interest. Existing semi-supervised methods typically treat data as purely numerical and} in a deterministic manner, thereby neglecting the heterogeneity and uncertainty inherent in complex, real-world datasets. This paper introduces a label-informed outlier detection method for heterogeneous data based on Granular Computing and Fuzzy Sets, namely Granule Density-based Outlier Factor (GDOF). Specifically, GDOF first employs label-informed fuzzy granulation to effectively represent various data types and develops granule density for precise density estimation. Subsequently, granule densities from individual attributes are integrated for outlier scoring by assessing attribute relevance with a limited number of labeled outliers. Experimental results on various real-world datasets show that GDOF stands out in detecting outliers in heterogeneous data with a minimal number of labeled outliers. The integration of Fuzzy Sets and Granular Computing in GDOF offers a practical framework for outlier detection in complex and diverse data types. All relevant datasets and source codes are publicly available for further research. This is the author's accepted manuscript of a paper published in IEEE Transactions on Fuzzy Systems. The final version is available at https://doi.org/10.1109/TFUZZ.2024.3514853
翻译:异常检测对于识别具有重要意义的异常模式至关重要,在众多应用中具有广泛影响,已引起相当多的研究关注。现有的半监督方法通常将数据视为纯数值并以确定性方式处理,从而忽视了复杂现实数据集中固有的异构性和不确定性。本文提出一种基于粒计算与模糊集的异构数据标签信息异常检测方法,即基于粒密度的异常因子(GDOF)。具体而言,GDOF首先采用标签信息模糊粒化来有效表示多种数据类型,并构建粒密度以实现精确的密度估计。随后,通过利用有限数量的标记异常样本来评估属性相关性,将来自各个属性的粒密度进行集成,以生成异常评分。在多种真实数据集上的实验结果表明,GDOF在使用极少标记异常样本的情况下,在异构数据的异常检测中表现突出。GDOF中模糊集与粒计算的结合为复杂多样数据类型的异常检测提供了一个实用框架。所有相关数据集和源代码均已公开以供进一步研究。本文为作者在IEEE Transactions on Fuzzy Systems上发表论文的录用稿。最终版本请访问:https://doi.org/10.1109/TFUZZ.2024.3514853