Mapping of spatial hotspots, i.e., regions with significantly higher rates of generating cases of certain events (e.g., disease or crime cases), is an important task in diverse societal domains, including public health, public safety, transportation, agriculture, environmental science, etc. Clustering techniques required by these domains differ from traditional clustering methods due to the high economic and social costs of spurious results (e.g., false alarms of crime clusters). As a result, statistical rigor is needed explicitly to control the rate of spurious detections. To address this challenge, techniques for statistically-robust clustering (e.g., scan statistics) have been extensively studied by the data mining and statistics communities. In this survey we present an up-to-date and detailed review of the models and algorithms developed by this field. We first present a general taxonomy for statistically-robust clustering, covering key steps of data and statistical modeling, region enumeration and maximization, and significance testing. We further discuss different paradigms and methods within each of the key steps. Finally, we highlight research gaps and potential future directions, which may serve as a stepping stone in generating new ideas and thoughts in this growing field and beyond.
翻译:因此,为了应对这一挑战,数据挖掘和统计界广泛研究了某些事件(如疾病或犯罪案件)的生成率高得多的区域,包括公共卫生、公共安全、交通、农业、环境科学等领域,这是一项重要任务。 这些领域所要求的集群技术不同于传统的集群方法,因为虚假结果(如犯罪群群的虚假警报)的经济和社会成本很高。结果,需要明确进行统计调整,以控制虚假检测率。为了应对这一挑战,数据挖掘和统计界广泛研究了统计-暴动组合技术(如扫描统计)。在这次调查中,我们介绍了该领域所开发模型和算法的最新详细审查。我们首先介绍了统计-暴动组合的总体分类,包括数据和统计模型的关键步骤、区域查点和最大化以及重要性测试。我们进一步讨论了每个关键步骤中不同的模式和方法。最后,我们强调了研究差距和未来方向,这些差距和潜在方向可能超越这个领域中不断增长的想法和新方向。