In this paper, we present a framework for reading analog clocks in natural images or videos. Specifically, we make the following contributions: First, we create a scalable pipeline for generating synthetic clocks, significantly reducing the requirements for the labour-intensive annotations; Second, we introduce a clock recognition architecture based on spatial transformer networks (STN), which is trained end-to-end for clock alignment and recognition. We show that the model trained on the proposed synthetic dataset generalises towards real clocks with good accuracy, advocating a Sim2Real training regime; Third, to further reduce the gap between simulation and real data, we leverage the special property of time, i.e. uniformity, to generate reliable pseudo-labels on real unlabelled clock videos, and show that training on these videos offers further improvements while still requiring zero manual annotations. Lastly, we introduce three benchmark datasets based on COCO, Open Images, and The Clock movie, totalling 4,472 images with clocks, with full annotations for time, accurate to the minute.
翻译:在本文中,我们提出了一个在自然图像或视频中读取模拟时钟的框架。具体来说,我们做出了以下贡献:第一,我们为制作合成时钟创建了可缩放的管道,大大减少了对劳动密集型说明的需求;第二,我们引入了基于空间变压器网络的时钟识别结构(STN),该结构经过了对时钟对齐和识别的训练,它端对端对端对端。我们展示了在拟议合成数据集中经过培训的模型,该模型以精确的方式向真实时钟提供精确的缩写,倡导Sim2Real培训制度;第三,为了进一步缩小模拟与真实数据之间的差距,我们利用时间的特殊属性,即一致性,在真实的无标签时钟视频上生成可靠的假标签,并展示这些视频的培训可以进一步改进,同时仍然需要零手动说明。最后,我们引入了三个基于CO、开放图像和Clock电影的基准数据集,共4 472张带有时钟的图像,并附有时间的完整说明,准确到分钟。