This paper explores the application of machine learning to enhance our understanding of water accessibility issues in underserved communities called Colonias located along the northern part of the United States - Mexico border. We analyzed more than 2000 such communities using data from the Rural Community Assistance Partnership (RCAP) and applied hierarchical clustering and the adaptive affinity propagation algorithm to automatically group Colonias into clusters with different water access conditions. The Gower distance was introduced to make the algorithm capable of processing complex datasets containing both categorical and numerical attributes. To better understand and explain the clustering results derived from the machine learning process, we further applied a decision tree analysis algorithm to associate the input data with the derived clusters, to identify and rank the importance of factors that characterize different water access conditions in each cluster. Our results complement experts' priority rankings of water infrastructure needs, providing a more in-depth view of the water insecurity challenges that the Colonias suffer from. As an automated and reproducible workflow combining a series of tools, the proposed machine learning pipeline represents an operationalized solution for conducting data-driven analysis to understand water access inequality. This pipeline can be adapted to analyze different datasets and decision scenarios.
翻译:本文研究了机器学习在增进我们对服务不足的边境社区 Colonias 水资源获取难题的认识方面的应用。本研究使用 Rural Community Assistance Partnership(RCAP)的数据,分析了2000多个 Colonias 社区,并应用分层聚类算法和自适应亲和力传播算法自动将 Colonias 社区分组为不同水资源获取条件的簇。为了使算法能够处理包含类别和数值属性的复杂数据集,本研究引入了 Gower 距离。为了更好地理解和解释从机器学习过程中得出的聚类结果,我们进一步应用决策树分析算法将输入数据与得出的簇关联起来,以确定并排名不同簇中表征不同水资源获取条件的因素的重要性。我们的研究结果补充了专家们关于水基础设施需求的优先级排名,提供了 Colonias 面临的水不安全挑战的更深入的视角。作为一个结合了一系列工具的自动化和可重复性工作流,本文提出的机器学习流程代表一种实用的解决方案,用于进行数据驱动型分析,以了解水资源获取不平等问题。该流程可以适用于分析不同的数据集和决策场景。