We establish a precise mathematical equivalence between witness-based similarity systems (REWA) and Shannon's information theory. We prove that witness overlap is mutual information, that REWA bit complexity bounds arise from channel capacity limitations, and that ranking-preserving encodings obey rate-distortion constraints. This unification reveals that fifty years of similarity search research -- from Bloom filters to locality-sensitive hashing to neural retrieval -- implicitly developed information theory for relational data. We derive fundamental lower bounds showing that REWA's $O(Δ^{-2} \log N)$ complexity is optimal: no encoding scheme can preserve similarity rankings with fewer bits. The framework establishes that semantic similarity has physical units (bits of mutual information), search is communication (query transmission over a noisy channel), and retrieval systems face fundamental capacity limits analogous to Shannon's channel coding theorem.
翻译:我们建立了基于见证的相似性系统(REWA)与香农信息论之间的精确数学等价性。我们证明了见证重叠即为互信息,REWA的比特复杂度界限源于信道容量限制,而保持排序的编码遵循率失真约束。这一统一揭示了五十年来相似性搜索研究——从布隆过滤器到局部敏感哈希再到神经检索——实际上为关系数据隐式发展了信息理论。我们推导出基本下界,表明REWA的$O(Δ^{-2} \log N)$复杂度是最优的:任何编码方案都无法用更少的比特保持相似性排序。该框架确立了语义相似性具有物理单位(互信息的比特),搜索即通信(在噪声信道上的查询传输),且检索系统面临与香农信道编码定理类似的基本容量限制。