Camera traps are a strategy for monitoring wildlife and they collect a large number of pictures. The number of images collected of each species usually follows a long-tail distribution, i.e., a few classes have a large number of instances, while a lot of species have just a small percentage. Although in most cases these rare species are the ones of interest to ecologists, they are often neglected when using deep-learning models because these models require a large number of images for the training. In this work, a simple and effective framework called Square-Root Sampling Branch (SSB) is proposed, which combines two classification branches that are trained using square-root sampling and instance sampling to improve long-tail visual recognition, and this is compared to state-of-the-art methods for handling this task: square-root sampling, class-balanced focal loss, and balanced group softmax. To achieve a more general conclusion, the methods for handling long-tail visual recognition were systematically evaluated in four families of computer vision models (ResNet, MobileNetV3, EfficientNetV2, and Swin Transformer) and four camera-trap datasets with different characteristics. Initially, a robust baseline with the most recent training tricks was prepared and, then, the methods for improving long-tail recognition were applied. Our experiments show that square-root sampling was the method that most improved the performance for minority classes by around 15%; however, this was at the cost of reducing the majority classes' accuracy by at least 3%. Our proposed framework (SSB) demonstrated itself to be competitive with the other methods and achieved the best or the second-best results for most of the cases for the tail classes; but, unlike the square-root sampling, the loss in the performance of the head classes was minimal, thus achieving the best trade-off among all the evaluated methods.
翻译:相机陷阱是监测野生生物的一种策略,它们收集了大量图片。 每个物种收集的图像数量通常经过长尾分布, 也就是说, 少数种类收集的图像数量通常经过长尾分布, 也就是说, 少数类收集了大量实例, 而许多物种只是小百分比。 虽然这些稀有物种在多数情况下是生态学家感兴趣的, 但是在使用深层学习模型时,它们常常被忽略, 因为这些模型需要大量的图像来进行培训。 在这项工作中, 提出了一个简单有效的框架, 称为 Square- Rooot 取样处( SSSB) 。 它将两个通过平底采样采样取样和试样取样的样本部门合并起来, 这与最先进的方法相比: 平底采样、 级平衡的焦点损失和平衡组的软体。 为了更普遍的结论, 在计算机视觉模型的四个贸易组( ResNet, Movetal NetV3, Syald NetV2, Swin transfer) 和四个摄像组数据组的取样组进行合并, 提高长尾采样的取样组本身的样本识别, 3级 。 因此, 以最强的精度的精度 以最精确的精度为最精确的精度分析方法, 我们的精度的精度的精度进行了最精确的精度的精度的精细的精细的精细的精细的精细的精细的精细的精细的精度评估, 。