Modern machine learning embeddings provide powerful compression of high-dimensional data, yet they typically destroy the geometric structure required for classical likelihood-based statistical inference. This paper develops a rigorous theory of likelihood-preserving embeddings: learned representations that can replace raw data in likelihood-based workflows -- hypothesis testing, confidence interval construction, model selection -- without altering inferential conclusions. We introduce the Likelihood-Ratio Distortion metric $Δ_n$, which measures the maximum error in log-likelihood ratios induced by an embedding. Our main theoretical contribution is the Hinge Theorem, which establishes that controlling $Δ_n$ is necessary and sufficient for preserving inference. Specifically, if the distortion satisfies $Δ_n = o_p(1)$, then (i) all likelihood-ratio based tests and Bayes factors are asymptotically preserved, and (ii) surrogate maximum likelihood estimators are asymptotically equivalent to full-data MLEs. We prove an impossibility result showing that universal likelihood preservation requires essentially invertible embeddings, motivating the need for model-class-specific guarantees. We then provide a constructive framework using neural networks as approximate sufficient statistics, deriving explicit bounds connecting training loss to inferential guarantees. Experiments on Gaussian and Cauchy distributions validate the sharp phase transition predicted by exponential family theory, and applications to distributed clinical inference demonstrate practical utility.
翻译:现代机器学习嵌入技术能够有效压缩高维数据,但通常会破坏经典基于似然的统计推断所需的几何结构。本文建立了似然保持嵌入的严格理论框架:这类学习表示可以在基于似然的工作流程(假设检验、置信区间构建、模型选择)中替代原始数据,且不会改变推断结论。我们提出了似然比失真度量$Δ_n$,用于衡量嵌入引起的对数似然比最大误差。主要理论贡献是铰链定理,该定理证明控制$Δ_n$是保持推断的必要充分条件。具体而言,若失真满足$Δ_n = o_p(1)$,则(i)所有基于似然比的检验与贝叶斯因子均能被渐近保持,(ii)替代最大似然估计量与全数据MLE渐近等价。我们证明了一个不可能性结果:普适的似然保持要求嵌入本质上可逆,这推动了对模型类别特定保证的需求。随后我们提出了使用神经网络作为近似充分统计量的构造性框架,推导出连接训练损失与推断保证的显式边界。在高斯分布和柯西分布上的实验验证了指数族理论预测的尖锐相变现象,分布式临床推断的应用则展示了该方法的实用价值。