Tuberculosis (TB), an infectious bacterial disease, is a significant cause of death, especially in low-income countries, with an estimated ten million new cases reported globally in $2020$. While TB is treatable, non-adherence to the medication regimen is a significant cause of morbidity and mortality. Thus, proactively identifying patients at risk of dropping off their medication regimen enables corrective measures to mitigate adverse outcomes. Using a proxy measure of extreme non-adherence and a dataset of nearly $700,000$ patients from four states in India, we formulate and solve the machine learning (ML) problem of early prediction of non-adherence based on a custom rank-based metric. We train ML models and evaluate against baselines, achieving a $\sim 100\%$ lift over rule-based baselines and $\sim 214\%$ over a random classifier, taking into account country-wide large-scale future deployment. We deal with various issues in the process, including data quality, high-cardinality categorical data, low target prevalence, distribution shift, variation across cohorts, algorithmic fairness, and the need for robustness and explainability. Our findings indicate that risk stratification of non-adherent patients is a viable, deployable-at-scale ML solution. As the official AI partner of India's Central TB Division, we are working on multiple city and state-level pilots with the goal of pan-India deployment.
翻译:肺结核是一种传染性细菌疾病,是造成死亡的重要原因,特别是在低收入国家,全球报告的新病例估计有1 000万新病例,为2020万美元。尽管肺结核是可以治疗的,但不遵守药物疗法是发病和死亡的重要原因。因此,主动查明有放弃药物疗法风险的病人有助于采取纠正措施,减轻不良后果。我们采用极端不遵守的代用措施和来自印度四个邦的近70万美元病人的数据集,制定和解决机器学习问题,即根据一种基于标准等级的衡量标准,及早预测不遵守规定的情况。我们培训ML模型并对照基线进行评估,在基于规则的基线基础上实现100美元升降价,在随机分类上达到214美元。考虑到全国范围的大规模未来部署。我们处理这一进程中的各种问题,包括数据质量、高心率绝对数据、低目标流行率、分布变化、各组之间差异、算法公平性、以及需要稳健性和解释性评估。我们的调查结果表明,在中央一级部署伙伴印度的多级部署目标中,一个风险是我们所能够部署的、跨级的、跨印度的试点。