This work draws attention to the large fraction of near-duplicates in the training and test sets of datasets widely adopted in License Plate Recognition (LPR) research. These duplicates refer to images that, although different, show the same license plate. Our experiments, conducted on the two most popular datasets in the field, show a substantial decrease in recognition rate when six well-known models are trained and tested under fair splits, that is, in the absence of duplicates in the training and test sets. Moreover, in one of the datasets, the ranking of models changed considerably when they were trained and tested under duplicate-free splits. These findings suggest that such duplicates have significantly biased the evaluation and development of deep learning-based models for LPR. The list of near-duplicates we have found and proposals for fair splits are publicly available for further research at https://raysonlaroca.github.io/supp/lpr-train-on-test/
翻译:本研究关注车牌识别(LPR)领域的研究数据集中训练和测试集中近似重复图像的数量。这些重复图像指的是虽然不同但显示相同车牌的图像。我们在该领域最受欢迎的两个数据集上进行的实验表明,当6个著名模型在训练集和测试集没有重复的情况下进行训练和测试时,车牌识别率显著降低。此外,在其中一个数据集中,当模型在无重复分组下进行训练和测试时,排名发生了明显变化。这些发现表明,这样的重复数据对基于深度学习的LPR模型的评估和开发产生了显著的偏差。我们发现的近似重复数据列表和公平分组提议已公开提供以供进一步研究。详见网站 https://raysonlaroca.github.io/supp/lpr-train-on-test/。