Efficient large-scale retrieval requires representations that are both compact and discriminative. Foundation models provide powerful visual and multimodal embeddings, but nearest neighbor search in these high-dimensional spaces is computationally expensive. Hashing offers an efficient alternative by enabling fast Hamming distance search with binary codes, yet existing approaches often rely on complex pipelines, multi-term objectives, designs specialized for a single learning paradigm, and long training times. We introduce CroVCA (Cross-View Code Alignment), a simple and unified principle for learning binary codes that remain consistent across semantically aligned views. A single binary cross-entropy loss enforces alignment, while coding-rate maximization serves as an anti-collapse regularizer to promote balanced and diverse codes. To implement this, we design HashCoder, a lightweight MLP hashing network with a final batch normalization layer to enforce balanced codes. HashCoder can be used as a probing head on frozen embeddings or to adapt encoders efficiently via LoRA fine-tuning. Across benchmarks, CroVCA achieves state-of-the-art results in just 5 training epochs. At 16 bits, it particularly well-for instance, unsupervised hashing on COCO completes in under 2 minutes and supervised hashing on ImageNet100 in about 3 minutes on a single GPU. These results highlight CroVCA's efficiency, adaptability, and broad applicability.
翻译:高效的大规模检索需要兼具紧凑性与判别性的表示。基础模型提供了强大的视觉与多模态嵌入,但在这些高维空间中进行最近邻搜索计算成本高昂。哈希通过支持快速的汉明距离二进制编码搜索提供了一种高效替代方案,然而现有方法通常依赖于复杂的流水线、多目标优化、针对单一学习范式专门化的设计以及较长的训练时间。我们提出了CroVCA(跨视图编码对齐),这是一种简单且统一的二进制编码学习原则,确保编码在语义对齐的视图间保持一致性。通过单一的二元交叉熵损失实现对齐,同时以编码率最大化作为抗坍缩正则化器,以促进平衡且多样化的编码。为实现这一目标,我们设计了HashCoder,这是一个轻量级MLP哈希网络,其末层采用批归一化以确保编码平衡。HashCoder可作为冻结嵌入的探测头使用,或通过LoRA微调高效适配编码器。在多个基准测试中,CroVCA仅需5个训练周期即可达到最先进性能。例如在16位编码下,无监督哈希在COCO数据集上可在2分钟内完成,有监督哈希在ImageNet100上仅需约3分钟(单GPU)。这些结果凸显了CroVCA的高效性、适应性与广泛适用性。