T-EMDE: 跨模式检索的切入基础全球相似性 (T-EMDE: Sketching-based global similarity for cross-modal retrieval)

The key challenge in cross-modal retrieval is to find similarities between objects represented with different modalities, such as image and text. However, each modality embeddings stem from non-related feature spaces, which causes the notorious 'heterogeneity gap'. Currently, many cross-modal systems try to bridge the gap with self-attention. However, self-attention has been widely criticized for its quadratic complexity, which prevents many real-life applications. In response to this, we propose T-EMDE - a neural density estimator inspired by the recently introduced Efficient Manifold Density Estimator (EMDE) from the area of recommender systems. EMDE operates on sketches - representations especially suitable for multimodal operations. However, EMDE is non-differentiable and ingests precomputed, static embeddings. With T-EMDE we introduce a trainable version of EMDE which allows full end-to-end training. In contrast to self-attention, the complexity of our solution is linear to the number of tokens/segments. As such, T-EMDE is a drop-in replacement for the self-attention module, with beneficial influence on both speed and metric performance in cross-modal settings. It facilitates communication between modalities, as each global text/image representation is expressed with a standardized sketch histogram which represents the same manifold structures irrespective of the underlying modality. We evaluate T-EMDE by introducing it into two recent cross-modal SOTA models and achieving new state-of-the-art results on multiple datasets and decreasing model latency by up to 20%.

翻译：跨模式检索的关键挑战是找到不同模式(如图像和文本)所代表对象之间的相似之处。然而,每种嵌入方式都来自非相关特征空间,导致臭名昭著的“异质差异”。目前,许多跨模式系统试图用自我注意来弥补差距。然而,自我关注因其四维复杂性而遭到广泛批评,这阻碍了许多真实应用。对此,我们提议T-EMDE - 由最近引入的高效调频密度模拟器(EMDE) 所启发的神经密度透视器(EMDE) 来自推荐系统领域。 EMDE 运行在草图上, 特别适合多式联运操作。然而, 许多跨模式系统试图用自我关注来弥补差距。然而, 自我关注自我关注已被广泛批评, 因为它引入了一种可训练的EMDE 版本, 从而可以进行全端到端培训。与自我认知不同的是,我们解决方案的复杂复杂性与两个图案/图案数量是直线直线的。如此, T-EM- DE 将一个快速的模型和每个运行模式的自我定位, 代表着一个稳定的文本, 将它与一个稳定的文本到一个快速的版本。