Two-branch network architecture has shown its efficiency and effectiveness in real-time semantic segmentation tasks. However, direct fusion of high-resolution details and low-frequency context has the drawback of detailed features being easily overwhelmed by surrounding contextual information. This overshoot phenomenon limits the improvement of the segmentation accuracy of existing two-branch models. In this paper, we make a connection between Convolutional Neural Networks (CNN) and Proportional-Integral-Derivative (PID) controllers and reveal that a two-branch network is equivalent to a Proportional-Integral (PI) controller, which inherently suffers from similar overshoot issues. To alleviate this problem, we propose a novel three-branch network architecture: PIDNet, which contains three branches to parse detailed, context and boundary information, respectively, and employs boundary attention to guide the fusion of detailed and context branches. Our family of PIDNets achieve the best trade-off between inference speed and accuracy and their accuracy surpasses all the existing models with similar inference speed on the Cityscapes and CamVid datasets. Specifically, PIDNet-S achieves 78.6% mIOU with inference speed of 93.2 FPS on Cityscapes and 80.1% mIOU with speed of 153.7 FPS on CamVid.
翻译:两分支网络结构已经在实时语义分割任务中显示出其效率和有效性。然而,高分辨率细节和低频上下文直接融合的缺点是详情特征很容易被周围的上下文信息淹没。这种超调现象限制了现有的两分支模型分割精度的提高。在本文中,我们建立了卷积神经网络(CNN)和比例-积分-微分(PID)控制器之间的联系,并揭示了两分支网络等效于比例-积分(PI)控制器,其本质上具有类似的超调问题。为了减轻这个问题,我们提出了一种新颖的三分支网络结构:PIDNet,它包含三个分支,分别解析详细信息、上下文信息和边界信息,并采用边界关注指导详细分支和上下文分支的融合。我们的PIDNet系列在推理速度和准确性之间取得了最佳折中点,并且它们的准确性超过了所有类似推理速度的现有模型,在Cityscapes和CamVid数据集上均如此。具体而言,PIDNet-S在Cityscapes上的推理速度为93.2 FPS,mIOU为78.6%;在CamVid上的推理速度为153.7 FPS,mIOU为80.1%。