Machine learning-based estimates of poverty and wealth are increasingly being used to guide the targeting of humanitarian aid and the allocation of social assistance. However, the ground truth labels used to train these models are typically borrowed from existing surveys that were designed to produce national statistics -- not to train machine learning models. Here, we test whether adaptive sampling strategies for ground truth data collection can improve the performance of poverty prediction models. Through simulations, we compare the status quo sampling strategies (uniform at random and stratified random sampling) to alternatives that prioritize acquiring training data based on model uncertainty or model performance on sub-populations. Perhaps surprisingly, we find that none of these active learning methods improve over uniform-at-random sampling. We discuss how these results can help shape future efforts to refine machine learning-based estimates of poverty.
翻译:正在越来越多地利用基于机学的贫穷和财富估计数来指导确定人道主义援助对象和分配社会援助,然而,用于培训这些模型的地面真相标签通常是从旨在编制国家统计的现有调查中借用的,而不是用来培训机器学习模型。在这里,我们测试地面真相数据收集的适应性抽样战略能否改善贫穷预测模型的性能。我们通过模拟,将现状抽样战略(随机抽样和分层随机抽样统一)与根据模型不确定性或亚人口模型性能优先获取培训数据的替代方法进行比较。也许令人惊讶的是,我们发现这些积极学习方法中没有一个比统一随机抽样方法改进。我们讨论了这些结果如何有助于影响今后改进基于机器学习的贫穷估计的努力。