Spatial pyramid pooling module or encode-decoder structure are used in deep neural networks for semantic segmentation task. The former networks are able to encode multi-scale contextual information by probing the incoming features with filters or pooling operations at multiple rates and multiple effective fields-of-view, while the latter networks can capture sharper object boundaries by gradually recovering the spatial information. In this work, we propose to combine the advantages from both methods. Specifically, our proposed model, DeepLabv3+, extends DeepLabv3 by adding a simple yet effective decoder module to refine the segmentation results especially along object boundaries. We further explore the Xception model and apply the depthwise separable convolution to both Atrous Spatial Pyramid Pooling and decoder modules, resulting in a faster and stronger encoder-decoder network. We demonstrate the effectiveness of the proposed model on the PASCAL VOC 2012 semantic image segmentation dataset and achieve a performance of 89% on the test set without any post-processing. Our paper is accompanied with a publicly available reference implementation of the proposed models in Tensorflow.
In this project, we present ShelfNet, a lightweight convolutional neural network for accurate real-time semantic segmentation. Different from the standard encoder-decoder structure, ShelfNet has multiple encoder-decoder branch pairs with skip connections at each spatial level, which looks like a shelf with multiple columns. The shelf-shaped structure provides multiple paths for information flow and improves segmentation accuracy. Inspired by the success of recurrent convolutional neural networks, we use modified residual blocks where two convolutional layers share weights. The shared-weight block enables efficient feature extraction and model size reduction. We tested ShelfNet with ResNet50 and ResNet101 as the backbone respectively: they achieved 59 FPS and 42 FPS respectively on a GTX 1080Ti GPU with a 512x512 input image. ShelfNet achieved high accuracy: on PASCAL VOC 2012 test set, it achieved 84.2% mIoU with ResNet101 backbone and 82.8% mIoU with ResNet50 backbone; it achieved 75.8% mIoU with ResNet50 backbone on Cityscapes dataset. ShelfNet achieved both higher mIoU and faster inference speed compared with state-of-the-art real-time semantic segmentation models. We provide the implementation https://github.com/juntang-zhuang/ShelfNet.
3D image segmentation plays an important role in biomedical image analysis. Many 2D and 3D deep learning models have achieved state-of-the-art segmentation performance on 3D biomedical image datasets. Yet, 2D and 3D models have their own strengths and weaknesses, and by unifying them together, one may be able to achieve more accurate results. In this paper, we propose a new ensemble learning framework for 3D biomedical image segmentation that combines the merits of 2D and 3D models. First, we develop a fully convolutional network based meta-learner to learn how to improve the results from 2D and 3D models (base-learners). Then, to minimize over-fitting for our sophisticated meta-learner, we devise a new training method that uses the results of the base-learners as multiple versions of "ground truths". Furthermore, since our new meta-learner training scheme does not depend on manual annotation, it can utilize abundant unlabeled 3D image data to further improve the model. Extensive experiments on two public datasets (the HVSMR 2016 Challenge dataset and the mouse piriform cortex dataset) show that our approach is effective under fully-supervised, semi-supervised, and transductive settings, and attains superior performance over state-of-the-art image segmentation methods.
Real-time semantic segmentation plays an important role in practical applications such as self-driving and robots. Most research working on semantic segmentation focuses on accuracy with little consideration for efficiency. Several existing studies that emphasize high-speed inference often cannot produce high-accuracy segmentation results. In this paper, we propose a novel convolutional network named Efficient Dense modules with Asymmetric convolution (EDANet), which employs an asymmetric convolution structure incorporating the dilated convolution and the dense connectivity to attain high efficiency at low computational cost, inference time, and model size. Compared to FCN, EDANet is 11 times faster and has 196 times fewer parameters, while it achieves a higher the mean of intersection-over-union (mIoU) score without any additional decoder structure, context module, post-processing scheme, and pretrained model. We evaluate EDANet on Cityscapes and CamVid datasets to evaluate its performance and compare it with the other state-of-art systems. Our network can run on resolution 512x1024 inputs at the speed of 108 and 81 frames per second on a single GTX 1080Ti and Titan X, respectively.
We present a new method for synthesizing high-resolution photo-realistic images from semantic label maps using conditional generative adversarial networks (conditional GANs). Conditional GANs have enabled a variety of applications, but the results are often limited to low-resolution and still far from realistic. In this work, we generate 2048x1024 visually appealing results with a novel adversarial loss, as well as new multi-scale generator and discriminator architectures. Furthermore, we extend our framework to interactive visual manipulation with two additional features. First, we incorporate object instance segmentation information, which enables object manipulations such as removing/adding objects and changing the object category. Second, we propose a method to generate diverse results given the same input, allowing users to edit the object appearance interactively. Human opinion studies demonstrate that our method significantly outperforms existing methods, advancing both the quality and the resolution of deep image synthesis and editing.
In this work, we evaluate the use of superpixel pooling layers in deep network architectures for semantic segmentation. Superpixel pooling is a flexible and efficient replacement for other pooling strategies that incorporates spatial prior information. We propose a simple and efficient GPU-implementation of the layer and explore several designs for the integration of the layer into existing network architectures. We provide experimental results on the IBSR and Cityscapes dataset, demonstrating that superpixel pooling can be leveraged to consistently increase network accuracy with minimal computational overhead. Source code is available at https://github.com/bermanmaxim/superpixPool
For the challenging semantic image segmentation task the most efficient models have traditionally combined the structured modelling capabilities of Conditional Random Fields (CRFs) with the feature extraction power of CNNs. In more recent works however, CRF post-processing has fallen out of favour. We argue that this is mainly due to the slow training and inference speeds of CRFs, as well as the difficulty of learning the internal CRF parameters. To overcome both issues we propose to add the assumption of conditional independence to the framework of fully-connected CRFs. This allows us to reformulate the inference in terms of convolutions, which can be implemented highly efficiently on GPUs. Doing so speeds up inference and training by a factor of more then 100. All parameters of the convolutional CRFs can easily be optimized using backpropagation. To facilitating further CRF research we make our implementation publicly available. Please visit: https://github.com/MarvinTeichmann/ConvCRF
In this paper, we propose an efficient architecture for semantic image segmentation using the depth-to-space (D2S) operation. Our D2S model is comprised of a standard CNN encoder followed by a depth-to-space reordering of the final convolutional feature maps; thus eliminating the decoder portion of traditional encoder-decoder segmentation models and reducing computation time almost by half. As a participant of the DeepGlobe Road Extraction competition, we evaluate our models on the corresponding road segmentation dataset. Our highly efficient D2S models exhibit comparable performance to standard segmentation models with much less computational cost.
We propose a novel locally adaptive learning estimator for enhancing the inter- and intra- discriminative capabilities of Deep Neural Networks, which can be used as improved loss layer for semantic image segmentation tasks. Most loss layers compute pixel-wise cost between feature maps and ground truths, ignoring spatial layouts and interactions between neighboring pixels with same object category, and thus networks cannot be effectively sensitive to intra-class connections. Stride by stride, our method firstly conducts adaptive pooling filter operating over predicted feature maps, aiming to merge predicted distributions over a small group of neighboring pixels with same category, and then it computes cost between the merged distribution vector and their category label. Such design can make groups of neighboring predictions from same category involved into estimations on predicting correctness with respect to their category, and hence train networks to be more sensitive to regional connections between adjacent pixels based on their categories. In the experiments on Pascal VOC 2012 segmentation datasets, the consistently improved results show that our proposed approach achieves better segmentation masks against previous counterparts.
Image segmentation is considered to be one of the critical tasks in hyperspectral remote sensing image processing. Recently, convolutional neural network (CNN) has established itself as a powerful model in segmentation and classification by demonstrating excellent performances. The use of a graphical model such as a conditional random field (CRF) contributes further in capturing contextual information and thus improving the segmentation performance. In this paper, we propose a method to segment hyperspectral images by considering both spectral and spatial information via a combined framework consisting of CNN and CRF. We use multiple spectral cubes to learn deep features using CNN, and then formulate deep CRF with CNN-based unary and pairwise potential functions to effectively extract the semantic correlations between patches consisting of three-dimensional data cubes. Effective piecewise training is applied in order to avoid the computationally expensive iterative CRF inference. Furthermore, we introduce a deep deconvolution network that improves the segmentation masks. We also introduce a new dataset and experimented our proposed method on it along with several widely adopted benchmark datasets to evaluate the effectiveness of our method. By comparing our results with those from several state-of-the-art models, we show the promising potential of our method.
Convolutional networks are powerful visual models that yield hierarchies of features. We show that convolutional networks by themselves, trained end-to-end, pixels-to-pixels, exceed the state-of-the-art in semantic segmentation. Our key insight is to build "fully convolutional" networks that take input of arbitrary size and produce correspondingly-sized output with efficient inference and learning. We define and detail the space of fully convolutional networks, explain their application to spatially dense prediction tasks, and draw connections to prior models. We adapt contemporary classification networks (AlexNet, the VGG net, and GoogLeNet) into fully convolutional networks and transfer their learned representations by fine-tuning to the segmentation task. We then define a novel architecture that combines semantic information from a deep, coarse layer with appearance information from a shallow, fine layer to produce accurate and detailed segmentations. Our fully convolutional network achieves state-of-the-art segmentation of PASCAL VOC (20% relative improvement to 62.2% mean IU on 2012), NYUDv2, and SIFT Flow, while inference takes one third of a second for a typical image.