Large pre-trained transformers are on top of contemporary semantic segmentation benchmarks, but come with high computational cost and a lengthy training. To lift this constraint, we look at efficient semantic segmentation from a perspective of comprehensive knowledge distillation and consider to bridge the gap between multi-source knowledge extractions and transformer-specific patch embeddings. We put forward the Transformer-based Knowledge Distillation (TransKD) framework which learns compact student transformers by distilling both feature maps and patch embeddings of large teacher transformers, bypassing the long pre-training process and reducing the FLOPs by >85.0%. Specifically, we propose two fundamental and two optimization modules: (1) Cross Selective Fusion (CSF) enables knowledge transfer between cross-stage features via channel attention and feature map distillation within hierarchical transformers; (2) Patch Embedding Alignment (PEA) performs dimensional transformation within the patchifying process to facilitate the patch embedding distillation; (3) Global-Local Context Mixer (GL-Mixer) extracts both global and local information of a representative embedding; (4) Embedding Assistant (EA) acts as an embedding method to seamlessly bridge teacher and student models with the teacher's number of channels. Experiments on Cityscapes, ACDC, and NYUv2 datasets show that TransKD outperforms state-of-the-art distillation frameworks and rivals the time-consuming pre-training method. Code is available at https://github.com/RuipingL/TransKD.
翻译:高级改造前的大型变压器位于当代语义分解基准的顶端, 但其计算成本高, 且培训时间长。 为了解除这一限制, 我们从综合知识蒸馏的角度审视高效的语义分解, 并考虑弥合多源知识提取和变压器特定补丁嵌入之间的缺口。 我们推出了基于变压器的知识蒸馏( TransKD)框架, 该框架通过蒸馏大型教师变压器的地貌图和补丁嵌嵌入, 绕过长期的培训前进程, 并将 FLOP 减少为 > 85.0%。 具体地说, 我们提出了两个基本和两个优化模块:(1) 跨选择性拼凑( CSF) 通过频道关注和高等级变压器内地图蒸馏功能, 使跨阶段的特性之间能够进行知识转移; (2) Patch Emept Embelding Construction (PEA) 在补接合过程中进行量转换, 便利补装嵌入 Dlistal Inferal MIer (GL-Mixer) 提取具有代表性的嵌入/ Real- Deal- dreal- NAlegilding A- assing A- assing Achal A- assilling A.