Passive acoustic monitoring enables large-scale biodiversity assessment, but reliable classification of bioacoustic sounds requires not only high accuracy but also well-calibrated uncertainty estimates to ground decision-making. In bioacoustics, calibration is challenged by overlapping vocalisations, long-tailed species distributions, and distribution shifts between training and deployment data. The calibration of multi-label deep learning classifiers within the domain of bioacoustics has not yet been assessed. We systematically benchmark the calibration of four state-of-the-art multi-label bird sound classifiers on the BirdSet benchmark, evaluating both global, per-dataset and per-class calibration using threshold-free calibration metrics (ECE, MCS) alongside discrimination metrics (cmAP). Model calibration varies significantly across datasets and classes. While Perch v2 and ConvNeXt$_{BS}$ show better global calibration, results vary between datasets. Both models indicate consistent underconfidence, while AudioProtoPNet and BirdMAE are mostly overconfident. Surprisingly, calibration seems to be better for less frequent classes. Using simple post hoc calibration methods we demonstrate a straightforward way to improve calibration. A small labelled calibration set is sufficient to significantly improve calibration with Platt scaling, while global calibration parameters suffer from dataset variability. Our findings highlight the importance of evaluating and improving uncertainty calibration in bioacoustic classifiers.
翻译:被动声学监测支持大规模生物多样性评估,但生物声学声音的可靠分类不仅需要高精度,还需要经过良好校准的不确定性估计以支撑决策。在生物声学领域,校准面临重叠发声、长尾物种分布以及训练与部署数据间分布偏移的挑战。多标签深度学习分类器在生物声学领域的校准尚未得到系统评估。我们在BirdSet基准上系统性地评估了四种先进多标签鸟类声音分类器的校准性能,使用无需阈值的校准指标(ECE、MCS)与区分度指标(cmAP)共同评估了全局、每数据集及每类别的校准情况。模型校准在不同数据集和类别间存在显著差异。虽然Perch v2和ConvNeXt$_{BS}$表现出更好的全局校准,但结果因数据集而异。这两种模型均呈现一致的低置信度倾向,而AudioProtoPNet和BirdMAE则大多表现为高置信度。值得注意的是,校准效果在低频类别中反而更优。通过简单的后处理校准方法,我们展示了一种改进校准的直接途径:使用少量标注校准数据即可通过Platt缩放显著提升校准效果,而全局校准参数受数据集变异性的影响较大。本研究结果凸显了评估和改进生物声学分类器不确定性校准的重要性。