Tasks that involve high-resolution dense prediction require a modeling of both local and global patterns in a large input field. Although the local and global structures often depend on each other and their simultaneous modeling is important, many convolutional neural network (CNN)-based approaches interchange representations in different resolutions only a few times. In this paper, we claim the importance of a dense simultaneous modeling of multiresolution representation and propose a novel CNN architecture called densely connected multidilated DenseNet (D3Net). D3Net involves a novel multidilated convolution that has different dilation factors in a single layer to model different resolutions simultaneously. By combining the multidilated convolution with the DenseNet architecture, D3Net incorporates multiresolution learning with an exponentially growing receptive field in almost all layers, while avoiding the aliasing problem that occurs when we naively incorporate the dilated convolution in DenseNet. Experiments on the image semantic segmentation task using Cityscapes and the audio source separation task using MUSDB18 show that the proposed method has superior performance over state-of-the-art methods.
翻译:涉及高分辨率密集预测的任务要求在一个大型输入领域对本地和全球模式进行建模。虽然本地和全球结构往往互相依赖,同时建模也很重要,但许多基于进化神经网络(CNN)的方法在不同分辨率中只是几次互换。在本文中,我们声称密集同时建模多分辨率代表器的重要性,并提议一个叫作密集连通的多光线网(D3Net)的新颖CNN结构。D3Net涉及一个新颖的多层演进,在同一个层次上具有不同的变异因素,可以同时建模不同的分辨率。D3Net结合多光化的演进与DenseNet结构,结合了多分辨率学习,几乎在所有层次上都有一个迅速增长的可容纳场,同时避免了当我们天性地将DenseNet的变异变法纳入时出现的别喻问题。利用城市景和MUDB18的音源分离任务,对图像拼图断任务进行了实验,显示拟议方法的性优于状态方法。