Distribution shift is the defining challenge of real-world machine learning. The dominant paradigm--Unsupervised Domain Adaptation (UDA)--enforces feature invariance, aligning source and target representations via symmetric divergence minimization [Ganin et al., 2016]. We demonstrate that this approach is fundamentally flawed: when domains are unequally informative (e.g., high-quality vs degraded sensors), strict invariance necessitates information destruction, causing "negative transfer" that can be catastrophic in safety-critical applications [Wang et al., 2019]. We propose a decision-theoretic framework grounded in Le Cam's theory of statistical experiments [Le Cam, 1986], using constructive approximations to replace symmetric invariance with directional simulability. We introduce Le Cam Distortion, quantified by the Deficiency Distance $δ(E_1, E_2)$, as a rigorous upper bound for transfer risk conditional on simulability. Our framework enables transfer without source degradation by learning a kernel that simulates the target from the source. Across five experiments (genomics, vision, reinforcement learning), Le Cam Distortion achieves: (1) near-perfect frequency estimation in HLA genomics (correlation $r=0.999$, matching classical methods), (2) zero source utility loss in CIFAR-10 image classification (81.2% accuracy preserved vs 34.7% drop for CycleGAN), and (3) safe policy transfer in RL control where invariance-based methods suffer catastrophic collapse. Le Cam Distortion provides the first principled framework for risk-controlled transfer learning in domains where negative transfer is unacceptable: medical imaging, autonomous systems, and precision medicine.
翻译:分布偏移是现实世界机器学习中的核心挑战。主流范式——无监督域适应(UDA)——通过对称散度最小化强制特征不变性,对齐源域和目标域表示[Ganin et al., 2016]。我们证明这种方法存在根本性缺陷:当域的信息量不相等时(例如,高质量传感器与退化传感器),严格的不变性必然导致信息破坏,引发“负迁移”,这在安全关键应用中可能是灾难性的[Wang et al., 2019]。我们提出一个基于Le Cam统计实验理论的决策理论框架[Le Cam, 1986],利用构造性近似将对称不变性替换为方向可模拟性。我们引入Le Cam失真,由缺陷距离$δ(E_1, E_2)$量化,作为可模拟性条件下迁移风险的严格上界。我们的框架通过学习一个从源域模拟目标域的核,实现不损害源域性能的迁移。在五项实验(基因组学、视觉、强化学习)中,Le Cam失真实现了:(1)HLA基因组学中近乎完美的频率估计(相关性$r=0.999$,与经典方法相当),(2)CIFAR-10图像分类中源域效用零损失(准确率保持81.2%,而CycleGAN下降34.7%),以及(3)RL控制中的安全策略迁移,而基于不变性的方法则遭受灾难性崩溃。Le Cam失真为医学影像、自主系统和精准医疗等负迁移不可接受的领域,提供了首个风险可控迁移学习的原理性框架。