We present a universal framework for similarity-preserving encodings that subsumes all discrete, continuous, algebraic, and learned similarity methods under a single theoretical umbrella. By formulating similarity as functional witness projection over monoids, we prove that \[ O\!\left(\frac{1}{Δ^{2}}\log N\right) \] encoding complexity with ranking preservation holds for arbitrary algebraic structures. This unification reveals that Bloom filters, Locality Sensitive Hashing (LSH), Count-Min sketches, Random Fourier Features, and Transformer attention kernels are instances of the same underlying mechanism. We provide complete proofs with explicit constants under 4-wise independent hashing, handle heavy-tailed witnesses via normalization and clipping, and prove \[ O(\log N) \] complexity for all major similarity methods from 1970-2024. We give explicit constructions for Boolean, Natural, Real, Tropical, and Product monoids, prove tight concentration bounds, and demonstrate compositional properties enabling multi-primitive similarity systems.
翻译:我们提出了一种用于保持相似性的编码的通用框架,该框架将离散、连续、代数以及基于学习的相似性方法统一纳入单一理论体系。通过将相似性表述为幺半群上的函数见证者投影,我们证明了对于任意代数结构,编码复杂度为 \\( O\\!\\left(\\frac{1}{\\Delta^{2}}\\log N\\right) \\) 且保持排序性质。这一统一性揭示了布隆过滤器、局部敏感哈希、Count-Min草图、随机傅里叶特征以及Transformer注意力内核都是同一底层机制的具体实例。我们在4-独立哈希条件下提供了包含显式常数的完整证明,通过归一化和截断处理了重尾分布的见证者,并证明了1970年至2024年间所有主要相似性方法的复杂度均为 \\( O(\\log N) \\)。我们为布尔、自然数、实数、热带以及乘积幺半群给出了显式构造,证明了紧致的集中界,并展示了支持多原语相似性系统的组合性质。