Estimating the prevalence of a medical condition, or the proportion of the population in which it occurs, is a fundamental problem in healthcare and public health. Accurate estimates of the relative prevalence across groups -- capturing, for example, that a condition affects women more frequently than men -- facilitate effective and equitable health policy which prioritizes groups who are disproportionately affected by a condition. However, it is difficult to estimate relative prevalence when a medical condition is underreported. In this work, we provide a method for accurately estimating the relative prevalence of underreported medical conditions, building upon the positive unlabeled learning framework. We show that under the commonly made covariate shift assumption -- i.e., that the probability of having a disease conditional on symptoms remains constant across groups -- we can recover the relative prevalence, even without restrictive assumptions commonly made in positive unlabeled learning and even if it is impossible to recover the absolute prevalence. We provide a suite of experiments on synthetic and real health data that demonstrate our method's ability to recover the relative prevalence more accurately than do baselines, and the method's robustness to plausible violations of the covariate shift assumption.
翻译:估计某一医疗条件的流行程度或发生这种疾病的人口比例,是保健和公共卫生的一个根本问题。对各群体相对流行程度的准确估计 -- -- 例如,一种疾病对女性的影响比男子更频繁 -- -- 有助于制定有效和公平的保健政策,优先照顾受某种疾病影响过大的群体。然而,在报告某一医疗条件不足时,很难估计相对普遍程度。在这项工作中,我们提供了一种方法,以积极的、未加标签的学习框架为基础,准确估计报告不足的医疗条件相对普遍程度。我们表明,根据通常的同化变换假设 -- -- 即以症状为条件的疾病的可能性在各群体之间保持不变 -- -- 我们可以恢复相对普遍流行,即便在积极的、未加标签的学习中通常不作限制性假设,即使不可能恢复绝对普遍程度。我们提供了一套合成和实际健康数据的实验,以证明我们的方法比基线更准确地恢复相对流行的能力,以及这种方法对有理由违反这种变换假设的可靠性。