This competition focus on Urban-Sense Segmentation based on the vehicle camera view. Class highly unbalanced Urban-Sense images dataset challenge the existing solutions and further studies. Deep Conventional neural network-based semantic segmentation methods such as encoder-decoder architecture and multi-scale and pyramid-based approaches become flexible solutions applicable to real-world applications. In this competition, we mainly review the literature and conduct experiments on transformer-driven methods especially SegFormer, to achieve an optimal trade-off between performance and efficiency. For example, SegFormer-B0 achieved 74.6% mIoU with the smallest FLOPS, 15.6G, and the largest model, SegFormer- B5 archived 80.2% mIoU. According to multiple factors, including individual case failure analysis, individual class performance, training pressure and efficiency estimation, the final candidate model for the competition is SegFormer- B2 with 50.6 GFLOPS and 78.5% mIoU evaluated on the testing set. Checkout our code implementation at https://vmv.re/cv3315.
翻译:这种竞争侧重于基于车辆摄像师观点的城市-感官分割; 高度不平衡的城市-感官图像数据集对现有的解决方案和进一步的研究提出了挑战; 深常规神经网络的静语分离方法,如编码器-解码器结构以及多尺度和金字塔法等,成为适用于现实世界应用的灵活解决办法; 在这种竞争中,我们主要审查关于变压器驱动方法,特别是SegFormer的文献并进行实验,以在性能和效率之间实现最佳的权衡。例如,SegFormer-B0与最小的FLOPS,15.6G和最大模型SegFormer-B5一起实现了74.6% mIoU,归档了80.2% mIoU。 根据多种因素,包括个案失败分析、个别班级业绩、培训压力和效率估计,竞争的最后候选模型是Segormer-B2, 测试集有50.6 GFLOPS和78.5% mIoU。检查我们的代码执行情况,见https://v.re/cv3315。