Knowledge Distillation (KD) has emerged as a promising technique for model compression but faces critical limitations: (1) sensitivity to hyperparameters requiring extensive manual tuning, (2) capacity gap when distilling from very large teachers to small students, (3) suboptimal coordination in multi-teacher scenarios, and (4) inefficient use of computational resources. We present \textbf{HPM-KD}, a framework that integrates six synergistic components: (i) Adaptive Configuration Manager via meta-learning that eliminates manual hyperparameter tuning, (ii) Progressive Distillation Chain with automatically determined intermediate models, (iii) Attention-Weighted Multi-Teacher Ensemble that learns dynamic per-sample weights, (iv) Meta-Learned Temperature Scheduler that adapts temperature throughout training, (v) Parallel Processing Pipeline with intelligent load balancing, and (vi) Shared Optimization Memory for cross-experiment reuse. Experiments on CIFAR-10, CIFAR-100, and tabular datasets demonstrate that HPM-KD: achieves 10x-15x compression while maintaining 85% accuracy retention, eliminates the need for manual tuning, and reduces training time by 30-40% via parallelization. Ablation studies confirm independent contribution of each component (0.10-0.98 pp). HPM-KD is available as part of the open-source DeepBridge library.
翻译:知识蒸馏已成为一种有前景的模型压缩技术,但仍面临关键局限:(1)对超参数敏感,需大量人工调优;(2)从极大教师模型蒸馏至极小学生模型时存在容量差距;(3)多教师场景中协调机制欠佳;(4)计算资源利用效率低下。本文提出\\textbf{HPM-KD}框架,整合了六项协同组件:(i)基于元学习的自适应配置管理器,消除人工超参数调优;(ii)具备自动确定中间模型的渐进式蒸馏链;(iii)学习动态样本权重的注意力加权多教师集成;(iv)适应全程训练温度变化的元学习温度调度器;(v)具备智能负载均衡的并行处理流水线;(vi)支持跨实验复用的共享优化内存。在CIFAR-10、CIFAR-100及表格数据集上的实验表明,HPM-KD在保持85%精度留存率的同时实现10-15倍压缩,无需人工调优,并通过并行化将训练时间减少30-40%。消融研究证实各组件均具有独立贡献(0.10-0.98个百分点)。HPM-KD已作为开源DeepBridge库的组成部分发布。