Training a neural network model for recognizing multiple labels associated with an image, including identifying unseen labels, is challenging, especially for images that portray numerous semantically diverse labels. As challenging as this task is, it is an essential task to tackle since it represents many real-world cases, such as image retrieval of natural images. We argue that using a single embedding vector to represent an image, as commonly practiced, is not sufficient to rank both relevant seen and unseen labels accurately. This study introduces an end-to-end model training for multi-label zero-shot learning that supports semantic diversity of the images and labels. We propose to use an embedding matrix having principal embedding vectors trained using a tailored loss function. In addition, during training, we suggest up-weighting in the loss function image samples presenting higher semantic diversity to encourage the diversity of the embedding matrix. Extensive experiments show that our proposed method improves the zero-shot model's quality in tag-based image retrieval achieving SoTA results on several common datasets (NUS-Wide, COCO, Open Images).
翻译:培训神经网络模型,以识别与图像相关的多个标签,包括识别看不见标签,这是一项艰巨的任务,特别是对于描绘多种语义多样性标签的图像而言,这是一项艰巨的任务。尽管这项任务具有挑战性,但它是一项重要的任务,因为它代表了许多真实世界的案例,例如自然图像的图像检索。我们认为,使用单一嵌入矢量来代表一个图像,正如通常做法那样,不足以准确排列相关可见标签和无形标签。本研究为多标签零弹学习引入了端到端模式培训,支持图像和标签的语义多样性。我们提议使用嵌入矩阵,配备主要嵌入矢量,使用定制的损失功能进行培训。此外,在培训期间,我们建议对损失函数图像样本进行加权,显示更高的语义多样性,鼓励嵌入矩阵的多样性。广泛的实验表明,我们拟议的方法改进了基于标签的图像检索的零点模型质量,实现多个通用数据集(NUS-Wide,CO, Openimages)的结果。