Citizen science datasets can be very large and promise to improve species distribution modelling, but detection is imperfect, risking bias when fitting models. In particular, observers may not detect species that are actually present. Occupancy models can estimate and correct for this observation process, and multi-species occupancy models exploit similarities in the observation process, which can improve estimates for rare species. However, the computational methods currently used to fit these models do not scale to large datasets. We develop approximate Bayesian inference methods and use graphics processing units (GPUs) to scale multi-species occupancy models to very large citizen science data. We fit multi-species occupancy models to one month of data from the eBird project consisting of 186,811 checklist records comprising 430 bird species. We evaluate the predictions on a spatially separated test set of 59,338 records, comparing two different inference methods -- Markov chain Monte Carlo (MCMC) and variational inference (VI) -- to occupancy models fitted to each species separately using maximum likelihood. We fitted models to the entire dataset using VI, and up to 32,000 records with MCMC. VI fitted to the entire dataset performed best, outperforming single-species models on both AUC (90.4% compared to 88.7%) and on log likelihood (-0.080 compared to -0.085). We also evaluate how well range maps predicted by the model agree with expert maps. We find that modelling the detection process greatly improves agreement and that the resulting maps agree as closely with expert maps as ones estimated using high quality survey data. Our results demonstrate that multi-species occupancy models are a compelling approach to model large citizen science datasets, and that, once the observation process is taken into account, they can model species distributions accurately.
翻译:公民科学数据集可能非常庞大,而且有望改进物种分布模型,但检测方法不完善,在设计模型时可能存在偏差。特别是,观察者可能无法探测实际存在的物种。观察模型可以估计和纠正这一观察过程,多物种占用模型利用观察过程中的相似之处,这可以改善稀有物种的估计数。然而,目前用于适应这些模型的计算方法并不与大型数据集相适应。我们开发了近似贝叶氏推断方法,并使用图形处理器(GPUs)将多物种占用模型(GPUs)与非常大的公民科学数据数据数据相匹配。我们将多物种占用模型模型模型与最接近的模型(我们发现模型和变异性模型(VI)与最接近。我们把整个模型安装到全套数据模型使用VI,并更新到EBird 包括430个鸟类记录。我们对空间分离的测试数据集的预测值为59,338记录,一旦将两种不同的推算方法 -- Mark链 Monte Carlo(MC) 模型和变价计算方法(我们发现模型和变价计算方法(VI) -- 使用最接近的模型可以使用最接近的模型,我们最接近的模型。我们最接近于每个物种的模型的模型的模型。我们把整个的模型的模型的模型装的模型装模型装模型与比重的模型比重的模型比重的模型比重的模型比重的模型比重的模型比重数据记录。