Desert locust outbreaks threaten the food security of a large part of Africa and have affected the livelihoods of millions of people over the years. Machine learning (ML) has been demonstrated as an effective approach to locust distribution modelling which could assist in early warning. ML requires a significant amount of labelled data to train. Most publicly available labelled data on locusts are presence-only data, where only the sightings of locusts being present at a location are recorded. Therefore, prior work using ML have resorted to pseudo-absence generation methods as a way to circumvent this issue. The most commonly used approach is to randomly sample points in a region of interest while ensuring that these sampled pseudo-absence points are at least a specific distance away from true presence points. In this paper, we compare this random sampling approach to more advanced pseudo-absence generation methods, such as environmental profiling and optimal background extent limitation, specifically for predicting desert locust breeding grounds in Africa. Interestingly, we find that for the algorithms we tested, namely logistic regression, gradient boosting, random forests and maximum entropy, all popular in prior work, the logistic model performed significantly better than the more sophisticated ensemble methods, both in terms of prediction accuracy and F1 score. Although background extent limitation combined with random sampling boosted performance for ensemble methods, for LR this was not the case, and instead, a significant improvement was obtained when using environmental profiling. In light of this, we conclude that a simpler ML approach such as logistic regression combined with more advanced pseudo-absence generation, specifically environmental profiling, can be a sensible and effective approach to predicting locust breeding grounds across Africa.
翻译:多年来,机器学习(ML)被证明是一种有效的蝗虫分布建模方法,可以帮助预警。ML需要大量贴有标签的数据来培训。我们将这种随机采样方法与更先进的假无迹生成方法进行比较,例如环境剖面和最佳背景限制,特别是用于预测非洲沙漠蝗虫繁殖地。有趣的是,我们发现,对于我们测试的算法,即物流回归、梯度增强、随机森林和最精良的计算法,我们以前的工作最常用的方法就是随机抽样点在感兴趣的区域,同时确保这些抽样的伪无踪迹点至少离真实存在点有一定距离。在本文件中,我们将这种随机采样方法与更先进的假无迹生成方法(例如环境剖面和最佳背景限制)进行比较。我们测试的算法,即物流回归法、梯度增强、随机森林和最精度(在以前的工作中很受欢迎)都是随机采样的取样点,而物流模型的精确度则比精确度高得多。