Motivation: A considerable number of data mining approaches for biomedical data analysis, including state-of-the-art associative models, require a form of data discretization. Although diverse discretization approaches have been proposed, they generally work under a strict set of statistical assumptions which are arguably insufficient to handle the diversity and heterogeneity of clinical and molecular variables within a given dataset. In addition, although an increasing number of symbolic approaches in bioinformatics are able to assign multiple items to values occurring near discretization boundaries for superior robustness, there are no reference principles on how to perform multi-item discretizations. Results: In this study, an unsupervised discretization method, DI2, for variables with arbitrarily skewed distributions is proposed. DI2 provides robust guarantees of generalization by placing data corrections using the Kolmogorov-Smirnov test before statistically fitting distribution candidates. DI2 further supports multi-item assignments. Results gathered from biomedical data show its relevance to improve classic discretization choices. Software: available at https://github.com/JupitersMight/DI2
翻译:动机:大量生物医学数据分析的数据挖掘方法,包括最先进的联合模型,需要某种形式的数据离散;虽然提出了各种不同的离散方法,但一般都是在一套严格的统计假设下开展工作,这些假设可能不足以处理某一数据集内临床和分子变量的多样性和异质性;此外,生物信息学中越来越多的象征性方法能够将多种物品分配到离散边界附近出现的值,以达到较高的稳健性,但对于如何执行多项目离散没有参考原则;结果:在这项研究中,提出了一种无监督的离散方法,即关于任意偏斜分布的变量的DI2。 DI2通过在统计上适当分配候选人之前使用科尔莫戈罗夫-斯米尔诺夫测试提供数据校正,为普遍化提供了有力的保障。DI2还支持多项目任务。从生物医学数据中收集的结果表明它对于改进传统的离散化选择具有相关性。软件:可在https://github.com/JupiditersMDI2上查阅。