With calls for increasing transparency, governments are releasing greater amounts of data in multiple domains including finance, education and healthcare. The efficient exploratory analysis of healthcare data constitutes a significant challenge. Key concerns in public health include the quick identification and analysis of trends, and the detection of outliers. This allows policies to be rapidly adapted to changing circumstances. We present an efficient outlier detection technique, termed PIKS (Pruned iterative-k means searchlight), which combines an iterative k-means algorithm with a pruned searchlight based scan. We apply this technique to identify outliers in two publicly available healthcare datasets from the New York Statewide Planning and Research Cooperative System, and California's Office of Statewide Health Planning and Development. We provide a comparison of our technique with three other existing outlier detection techniques, consisting of auto-encoders, isolation forests and feature bagging. We identified outliers in conditions including suicide rates, immunity disorders, social admissions, cardiomyopathies, and pregnancy in the third trimester. We demonstrate that the PIKS technique produces results consistent with other techniques such as the auto-encoder. However, the auto-encoder needs to be trained, which requires several parameters to be tuned. In comparison, the PIKS technique has far fewer parameters to tune. This makes it advantageous for fast, "out-of-the-box" data exploration. The PIKS technique is scalable and can readily ingest new datasets. Hence, it can provide valuable, up-to-date insights to citizens, patients and policy-makers. We have made our code open source, and with the availability of open data, other researchers can easily reproduce and extend our work. This will help promote a deeper understanding of healthcare policies and public health issues.
翻译:随着对于提高透明度的呼吁,政府正在多个领域(包括金融、教育和医疗保健)发布越来越多的数据。有效的医疗数据探索性分析构成了重大的挑战。公共卫生领域的重要关注点包括快速识别和分析趋势以及检测异常值,这样可以使政策快速适应不断变化的情况。我们提出了一种高效的异常值检测技术,称为 PIKS(经过修剪的迭代 k-means scanlight 搜索法),它将迭代 k-means 算法与基于修剪的搜索法扫描相结合。我们将此技术应用于两个来自美国纽约州同等规划与研究合作系统(SPARCS)和加利福尼亚州全州医疗保健规划和发展办公室(OSHPD)的公共医疗保健数据集中,以识别异常值。我们与三种现有的异常值检测技术——自动编码器、孤立森林和特征装袋进行了比较。我们确定了多种疾病中的异常值,包括自杀率、免疫障碍、社交住院、心肌病和孕妇的第三孕期。我们证明了 PIKS 技术产生的结果与自动编码器等其他技术一致。但是自动编码器需要进行训练,这需要调整几个参数。相比之下,PIKS 技术需要调整的参数较少,这使它在快速的“开箱即用”数据探索方面具有优势。PIKS 技术是可扩展的,并且可以轻松接收新的数据集。因此,它可以为市民、患者和政策制定者提供有价值的、最新的见解。我们已经将我们的代码开源,并且随着开放数据的可用性,其他研究人员可以轻松地重现和扩展我们的工作。这将有助于促进对医疗保健政策和公共卫生问题的深入理解。