多视图、多版本、多文本-视频编码:朝向一百千个空格单制单点对标识别 (Contrastive Multi-View Textual-Visual Encoding: Towards One Hundred Thousand-Scale One-Shot Logo Identification)

In this paper, we study the problem of identifying logos of business brands in natural scenes in an open-set one-shot setting. This problem setup is significantly more challenging than traditionally-studied 'closed-set' and 'large-scale training samples per category' logo recognition settings. We propose a novel multi-view textual-visual encoding framework that encodes text appearing in the logos as well as the graphical design of the logos to learn robust contrastive representations. These representations are jointly learned for multiple views of logos over a batch and thereby they generalize well to unseen logos. We evaluate our proposed framework for cropped logo verification, cropped logo identification, and end-to-end logo identification in natural scene tasks; and compare it against state-of-the-art methods. Further, the literature lacks a 'very-large-scale' collection of reference logo images that can facilitate the study of one-hundred thousand-scale logo identification. To fill this gap in the literature, we introduce Wikidata Reference Logo Dataset (WiRLD), containing logos for 100K business brands harvested from Wikidata. Our proposed framework that achieves an area under the ROC curve of 91.3% on the QMUL-OpenLogo dataset for the verification task, outperforms state-of-the-art methods by 9.1% and 2.6% on the one-shot logo identification task on the Toplogos-10 and the FlickrLogos32 datasets, respectively. Further, we show that our method is more stable compared to other baselines even when the number of candidate logos is on a 100K scale.

翻译：在本文中, 我们研究在自然场景中以开放的、一组一发式设置来识别商业品牌标识的问题。这个问题的设置比传统上研究的“ 闭塞” 和“ 每类的大型培训样本” 标识识别设置更具挑战性。我们提议了一个新颖的多视图文本- 视觉编码框架, 将出现在标识中的文本编码, 以及标识的图形设计, 以学习强度对比性标识。这些标识是针对批量对标识的多重观点共同学习的, 从而对看不见的标识进行概括化。我们评估了自然场景任务中裁剪裁的标识核实、裁剪的标识识别和端对端对端标识的标识标识标识识别的拟议框架; 比较了最先进的方法。此外, 文献缺少一个“ 非常大规模” 的参考标识图像集, 有助于研究十万种比例的标识识别。为了填补文献中的这一空白, 我们引入了维基数据参考基准引用的指南, 包括从Wikicarlobal- logal data 采集的100K 商标的标识、和从 Wikicarlobal- lobalbal- bal dal dal dalde 分别在一个 Valbal- sal- blus 上显示一个O- sal- blus robald- slus roupal- sal- slus lab the slook a lax lax lax lax lax lax lax laxx 。我们算出一个。我们算出一个“ 9- sild- sald- supal- sal- sild- sal- sild- sild- sild- sald- sal- sild- sal- sald- sild- sal- squaldaldaldald- labalbalbalbaldald- sal- sal- sal- sal- sal- sal- sal- sal- sal- sal- sal- sal- sal- sal- labal- sald- sal- sal- sal- sal-