健康食品获取误分类与糖尿病患病率关联研究 (Linking Potentially Misclassified Healthy Food Access to Diabetes Prevalence)

from arxiv, 26 pages (including 7 tables and 4 figures) with additional material available at https://github.com/ashleymullan/food_access_misclassification

Access to healthy food is key to maintaining a healthy lifestyle and can be quantified by the distance to the nearest grocery store. However, calculating this distance forces a trade-off between cost and correctness. Accurate route-based distances following passable roads are cost-prohibitive, while simple straight-line distances ignoring infrastructure and natural barriers are accessible yet error-prone. Categorizing low-access neighborhoods based on these straight-line distances induces misclassification and introduces bias into standard regression models estimating the relationship between disease prevalence and access. Yet, fully observing the more accurate, route-based food access measure is often impossible, which induces a missing data problem. We combat bias and address missingness with a new maximum likelihood estimator for Poisson regression with a binary, misclassified exposure (access to healthy food within some threshold), where the misclassification may depend on additional error-free covariates. In simulations, we show the consequence of ignoring the misclassification (bias) and how the proposed estimator corrects for bias while preserving more statistical efficiency than the complete case analysis (i.e., deleting observations with missing data). Finally, we apply our estimator to model the relationship between census tract diabetes prevalence and access to healthy food in northwestern North Carolina.

翻译：获取健康食品是维持健康生活方式的关键，可通过距离最近食品杂货店的远近进行量化。然而，距离计算需要在成本与准确性之间进行权衡：遵循可通行道路的精确路径距离成本过高，而忽略基础设施和自然障碍的简单直线距离虽易于获取但误差较大。基于这些直线距离对低食品获取社区进行分类会导致误分类，并将偏差引入评估疾病患病率与食品获取关系的标准回归模型。由于更精确的路径距离指标往往无法完全观测，这引发了数据缺失问题。我们提出一种新的最大似然估计量来解决偏差和缺失数据问题，该估计量适用于暴露变量（是否在特定阈值内获取健康食品）存在二元误分类的泊松回归模型，且误分类可能依赖于其他无误差协变量。模拟实验展示了忽略误分类的后果（偏差），并证明所提估计量在纠正偏差的同时，比完整案例分析（即删除缺失数据观测值）具有更高的统计效率。最后，我们将该估计量应用于北卡罗来纳州西北部人口普查区的糖尿病患病率与健康食品获取关系的建模研究。