Unsupervised discretization is a crucial step in many knowledge discovery tasks. The state-of-the-art method for one-dimensional data infers locally adaptive histograms using the minimum description length (MDL) principle, but the multi-dimensional case is far less studied: current methods consider the dimensions one at a time (if not independently), which result in discretizations based on rectangular cells of adaptive size. Unfortunately, this approach is unable to adequately characterize dependencies among dimensions and/or results in discretizations consisting of more cells (or bins) than is desirable. To address this problem, we propose an expressive model class that allows for far more flexible partitions of two-dimensional data. We extend the state of the art for the one-dimensional case to obtain a model selection problem based on the normalized maximum likelihood, a form of refined MDL. As the flexibility of our model class comes at the cost of a vast search space, we introduce a heuristic algorithm, named PALM, which Partitions each dimension ALternately and then Merges neighboring regions, all using the MDL principle. Experiments on synthetic data show that PALM 1) accurately reveals ground truth partitions that are within the model class (i.e., the search space), given a large enough sample size; 2) approximates well a wide range of partitions outside the model class; 3) converges, in contrast to the state-of-the-art multivariate discretization method IPD. Finally, we apply our algorithm to three spatial datasets, and we demonstrate that, compared to kernel density estimation (KDE), our algorithm not only reveals more detailed density changes, but also fits unseen data better, as measured by the log-likelihood.
翻译:不受监督的离散是许多知识发现任务中的关键一步。 一维数据最先进的方法用最小描述长度( MDL) 原则推断出本地适应性直方图,但多维情况研究得少得多:目前的方法是一次考虑维度(如果不是独立的话),这导致基于适应大小的矩形单元格的离散。 不幸的是, 这种方法无法充分描述不同维度和/ 或离散结果之间的依赖性, 由更多单元格( 或 bins) 构成的离异性。 为了解决这个问题, 我们提议了一个直观模型类, 允许使用更灵活的二维数据分布。 我们扩展了一维情况, 以基于标准化最大可能性( 一种精细化的 MDL 形式) 获得模型选择问题。 由于我们模型等级的灵活性以巨大的搜索空间为代价, 我们引入了一种叫做 PALM 的模型, 将每个维度都按不同维度调整, 然后将邻近区域合并, 使用 MDL 的多维值对数据分布进行更灵活的分区。 我们的模型将显示一个更精确的 。