This paper introduces SeaTurtleID, the first public large-scale, long-span dataset with sea turtle photographs captured in the wild. The dataset is suitable for benchmarking re-identification methods and evaluating several other computer vision tasks. The dataset consists of 7774 high-resolution photographs of 400 unique individuals collected within 12 years in 1081 encounters. Each photograph is accompanied by rich metadata, e.g., identity label, head segmentation mask, and encounter timestamp. The 12-year span of the dataset makes it the longest-spanned public wild animal dataset with timestamps. By exploiting this unique property, we show that timestamps are necessary for an unbiased evaluation of animal re-identification methods because they allow time-aware splits of the dataset into reference and query sets. We show that time-unaware splits can lead to performance overestimation of more than 100% compared to the time-aware splits for both feature- and CNN-based re-identification methods. We also argue that time-aware splits correspond to more realistic re-identification pipelines than the time-unaware ones. We recommend that animal re-identification methods should only be tested on datasets with timestamps using time-aware splits, and we encourage dataset curators to include such information in the associated metadata.
翻译:本文介绍SeaTurtleID, 这是第一个在野生捕捉海龟照片的大型、长期的公开数据集。 数据集适合用于基准再识别方法和评估其他计算机视觉任务。 数据集由在1081次相遇中收集的12年中收集的400个独特个人的7774张高分辨率照片组成。 每张照片都配有丰富的元数据, 例如身份标签、 头部分割面罩 和遇到时间戳。 该数据集的12年间隔使它成为有时间戳的最长公开野生动物数据集。 通过利用这一独特属性, 我们显示时间戳对于对动物再识别方法进行公正评估是必要的, 因为它们允许将数据集的时间分拆成参考和查询组。 我们显示, 时间- 软件分割可以导致高估100%以上的性能, 而基于地段和CNNIS的再识别方法则有时间间隔。 我们还认为, 时间分解与更现实的再识别管道相对于时间戳时标, 我们建议, 我们只能用时间间隔中的数据转换方法来鼓励数据再测试数据。