This paper introduces a two-phase deep feature calibration framework for efficient learning of semantics enhanced text-image cross-modal joint embedding, which clearly separates the deep feature calibration in data preprocessing from training the joint embedding model. We use the Recipe1M dataset for the technical description and empirical validation. In preprocessing, we perform deep feature calibration by combining deep feature engineering with semantic context features derived from raw text-image input data. We leverage LSTM to identify key terms, NLP methods to produce ranking scores for key terms before generating the key term feature. We leverage wideResNet50 to extract and encode the image category semantics to help semantic alignment of the learned recipe and image embeddings in the joint latent space. In joint embedding learning, we perform deep feature calibration by optimizing the batch-hard triplet loss function with soft-margin and double negative sampling, also utilizing the category-based alignment loss and discriminator-based alignment loss. Extensive experiments demonstrate that our SEJE approach with the deep feature calibration significantly outperforms the state-of-the-art approaches.
翻译:本文介绍了一个两阶段深度地貌校准框架,以有效学习语义强化文本图像跨模式联合嵌入,这明确区分了数据预处理中的深度地貌校准与联合嵌入模型的培训。我们使用Recipe1M数据集进行技术描述和经验验证。在预处理中,我们通过将深地貌工程与原始文本图像输入数据产生的语义背景特征相结合,进行深度地貌校准。我们利用LSTM确定关键术语,使用NLP方法为关键术语生成分数,然后生成关键术语特征。我们利用广域ResNet50来提取和编码图像类的语义学,以帮助将所学的食谱和图像嵌入联合潜在空间的语义校准。在联合嵌入学习中,我们进行深度地貌校准,通过软边框和双负抽样优化分硬三重损失功能,同时利用基于分类的校准损失和基于差异的校准损失。广泛的实验表明,我们的SEE与深地貌校准方法明显超越了共同隐学方法。