Semantic segmentation arises as the backbone of many vision systems, spanning from self-driving cars and robot navigation to augmented reality and teleconferencing. Frequently operating under stringent latency constraints within a limited resource envelope, optimising for efficient execution becomes important. To this end, we propose a framework for converting state-of-the-art segmentation models to MESS networks; specially trained CNNs that employ parametrised early exits along their depth to save computation during inference on easier samples. Designing and training such networks naively can hurt performance. Thus, we propose a two-staged training process that pushes semantically important features early in the network. We co-optimise the number, placement and architecture of the attached segmentation heads, along with the exit policy, to adapt to the device capabilities and application-specific requirements. Optimising for speed, MESS networks can achieve latency gains of up to 2.83x over state-of-the-art methods with no accuracy degradation. Accordingly, optimising for accuracy, we achieve an improvement of up to 5.33 pp, under the same computational budget.
翻译:语义分解是许多视觉系统的支柱,从自行驾驶的汽车和机器人导航到扩大现实和电话会议。通常在有限的资源封套内的严格隐蔽限制下运作,优化高效执行变得非常重要。为此,我们提出一个框架,将最先进的分解模型转换为MES网络;经过专门训练的CNN,在深度上使用偏差的早期出口,以便在较容易的样本的推断中节省计算。设计和培训这种网络天真地会损害性能。因此,我们提出一个两阶段培训进程,在网络早期推动具有重要意义的语义特征。我们共同优化连接的分解负责人的数量、位置和结构,连同退出政策,以适应设备能力和应用特定要求。优化速度,MES网络可以达到高于状态2.83x的延缓度收益,而没有准确性降解。因此,为了准确性,我们选择了精确性,在相同的计算预算下改进了5.33页。