Results of image stitching can be perceptually divided into single-perspective and multiple-perspective. Compared to the multiple-perspective result, the single-perspective result excels in perspective consistency but suffers from projective distortion. In this paper, we propose two single-perspective warps for natural image stitching. The first one is a parametric warp, which is a combination of the as-projective-as-possible warp and the quasi-homography warp via dual-feature. The second one is a mesh-based warp, which is determined by optimizing a total energy function that simultaneously emphasizes different characteristics of the single-perspective warp, including alignment, naturalness, distortion and saliency. A comprehensive evaluation demonstrates that the proposed warp outperforms some state-of-the-art warps, including homography, APAP, AutoStitch, SPHP and GSP.

4
下载
关闭预览

相关内容

图像拼接(image stitching)是指将两张或更多的有重叠部分的影像,拼接成一张全景图或是高分辨率影像的技术。图像拼接有两大步骤:图像配准和图像融合

We consider the problem of referring image segmentation. Given an input image and a natural language expression, the goal is to segment the object referred by the language expression in the image. Existing works in this area treat the language expression and the input image separately in their representations. They do not sufficiently capture long-range correlations between these two modalities. In this paper, we propose a cross-modal self-attention (CMSA) module that effectively captures the long-range dependencies between linguistic and visual features. Our model can adaptively focus on informative words in the referring expression and important regions in the input image. In addition, we propose a gated multi-level fusion module to selectively integrate self-attentive cross-modal features corresponding to different levels in the image. This module controls the information flow of features at different levels. We validate the proposed approach on four evaluation datasets. Our proposed approach consistently outperforms existing state-of-the-art methods.

0
16
下载
预览

Most deep learning-based models for speech enhancement have mainly focused on estimating the magnitude of spectrogram while reusing the phase from noisy speech for reconstruction. This is due to the difficulty of estimating the phase of clean speech. To improve speech enhancement performance, we tackle the phase estimation problem in three ways. First, we propose Deep Complex U-Net, an advanced U-Net structured model incorporating well-defined complex-valued building blocks to deal with complex-valued spectrograms. Second, we propose a polar coordinate-wise complex-valued masking method to reflect the distribution of complex ideal ratio masks. Third, we define a novel loss function, weighted source-to-distortion ratio (wSDR) loss, which is designed to directly correlate with a quantitative evaluation measure. Our model was evaluated on a mixture of the Voice Bank corpus and DEMAND database, which has been widely used by many deep learning models for speech enhancement. Ablation experiments were conducted on the mixed dataset showing that all three proposed approaches are empirically valid. Experimental results show that the proposed method achieves state-of-the-art performance in all metrics, outperforming previous approaches by a large margin.

0
15
下载
预览

Attention mechanisms are a design trend of deep neural networks that stands out in various computer vision tasks. Recently, some works have attempted to apply attention mechanisms to single image super-resolution (SR) tasks. However, they apply the mechanisms to SR in the same or similar ways used for high-level computer vision problems without much consideration of the different nature between SR and other problems. In this paper, we propose a new attention method, which is composed of new channel-wise and spatial attention mechanisms optimized for SR and a new fused attention to combine them. Based on this, we propose a new residual attention module (RAM) and a SR network using RAM (SRRAM). We provide in-depth experimental analysis of different attention mechanisms in SR. It is shown that the proposed method can construct both deep and lightweight SR networks showing improved performance in comparison to existing state-of-the-art methods.

0
6
下载
预览

We study the use of the Wave-U-Net architecture for speech enhancement, a model introduced by Stoller et al for the separation of music vocals and accompaniment. This end-to-end learning method for audio source separation operates directly in the time domain, permitting the integrated modelling of phase information and being able to take large temporal contexts into account. Our experiments show that the proposed method improves several metrics, namely PESQ, CSIG, CBAK, COVL and SSNR, over the state-of-the-art with respect to the speech enhancement task on the Voice Bank corpus (VCTK) dataset. We find that a reduced number of hidden layers is sufficient for speech enhancement in comparison to the original system designed for singing voice separation in music. We see this initial result as an encouraging signal to further explore speech enhancement in the time-domain, both as an end in itself and as a pre-processing step to speech recognition systems.

0
8
下载
预览

Current image captioning approaches generate descriptions which lack specific information, such as named entities that are involved in the images. In this paper we propose a new task which aims to generate informative image captions, given images and hashtags as input. We propose a simple but effective approach to tackle this problem. We first train a convolutional neural networks - long short term memory networks (CNN-LSTM) model to generate a template caption based on the input image. Then we use a knowledge graph based collective inference algorithm to fill in the template with specific named entities retrieved via the hashtags. Experiments on a new benchmark dataset collected from Flickr show that our model generates news-style image descriptions with much richer information. Our model outperforms unimodal baselines significantly with various evaluation metrics.

0
4
下载
预览

Seam-cutting and seam-driven techniques have been proven effective for handling imperfect image series in image stitching. Generally, seam-driven is to utilize seam-cutting to find a best seam from one or finite alignment hypotheses based on a predefined seam quality metric. However, the quality metrics in most methods are defined to measure the average performance of the pixels on the seam without considering the relevance and variance among them. This may cause that the seam with the minimal measure is not optimal (perception-inconsistent) in human perception. In this paper, we propose a novel coarse-to-fine seam estimation method which applies the evaluation in a different way. For pixels on the seam, we develop a patch-point evaluation algorithm concentrating more on the correlation and variation of them. The evaluations are then used to recalculate the difference map of the overlapping region and reestimate a stitching seam. This evaluation-reestimation procedure iterates until the current seam changes negligibly comparing with the previous seams. Experiments show that our proposed method can finally find a nearly perception-consistent seam after several iterations, which outperforms the conventional seam-cutting and other seam-driven methods.

0
4
下载
预览

Answering visual questions need acquire daily common knowledge and model the semantic connection among different parts in images, which is too difficult for VQA systems to learn from images with the only supervision from answers. Meanwhile, image captioning systems with beam search strategy tend to generate similar captions and fail to diversely describe images. To address the aforementioned issues, we present a system to have these two tasks compensate with each other, which is capable of jointly producing image captions and answering visual questions. In particular, we utilize question and image features to generate question-related captions and use the generated captions as additional features to provide new knowledge to the VQA system. For image captioning, our system attains more informative results in term of the relative improvements on VQA tasks as well as competitive results using automated metrics. Applying our system to the VQA tasks, our results on VQA v2 dataset achieve 65.8% using generated captions and 69.1% using annotated captions in validation set and 68.4% in the test-standard set. Further, an ensemble of 10 models results in 69.7% in the test-standard split.

0
6
下载
预览

Image manipulation detection is different from traditional semantic object detection because it pays more attention to tampering artifacts than to image content, which suggests that richer features need to be learned. We propose a two-stream Faster R-CNN network and train it endto- end to detect the tampered regions given a manipulated image. One of the two streams is an RGB stream whose purpose is to extract features from the RGB image input to find tampering artifacts like strong contrast difference, unnatural tampered boundaries, and so on. The other is a noise stream that leverages the noise features extracted from a steganalysis rich model filter layer to discover the noise inconsistency between authentic and tampered regions. We then fuse features from the two streams through a bilinear pooling layer to further incorporate spatial co-occurrence of these two modalities. Experiments on four standard image manipulation datasets demonstrate that our two-stream framework outperforms each individual stream, and also achieves state-of-the-art performance compared to alternative methods with robustness to resizing and compression.

0
8
下载
预览

Person Re-Identification (ReID) requires comparing two images of person captured under different conditions. Existing work based on neural networks often computes the similarity of feature maps from one single convolutional layer. In this work, we propose an efficient, end-to-end fully convolutional Siamese network that computes the similarities at multiple levels. We demonstrate that multi-level similarity can improve the accuracy considerably using low-complexity network structures in ReID problem. Specifically, first, we use several convolutional layers to extract the features of two input images. Then, we propose Convolution Similarity Network to compute the similarity score maps for the inputs. We use spatial transformer networks (STNs) to determine spatial attention. We propose to apply efficient depth-wise convolution to compute the similarity. The proposed Convolution Similarity Networks can be inserted into different convolutional layers to extract visual similarities at different levels. Furthermore, we use an improved ranking loss to further improve the performance. Our work is the first to propose to compute visual similarities at low, middle and high levels for ReID. With extensive experiments and analysis, we demonstrate that our system, compact yet effective, can achieve competitive results with much smaller model size and computational complexity.

0
4
下载
预览

Image segmentation is considered to be one of the critical tasks in hyperspectral remote sensing image processing. Recently, convolutional neural network (CNN) has established itself as a powerful model in segmentation and classification by demonstrating excellent performances. The use of a graphical model such as a conditional random field (CRF) contributes further in capturing contextual information and thus improving the segmentation performance. In this paper, we propose a method to segment hyperspectral images by considering both spectral and spatial information via a combined framework consisting of CNN and CRF. We use multiple spectral cubes to learn deep features using CNN, and then formulate deep CRF with CNN-based unary and pairwise potential functions to effectively extract the semantic correlations between patches consisting of three-dimensional data cubes. Effective piecewise training is applied in order to avoid the computationally expensive iterative CRF inference. Furthermore, we introduce a deep deconvolution network that improves the segmentation masks. We also introduce a new dataset and experimented our proposed method on it along with several widely adopted benchmark datasets to evaluate the effectiveness of our method. By comparing our results with those from several state-of-the-art models, we show the promising potential of our method.

0
10
下载
预览
小贴士
相关论文
Cross-Modal Self-Attention Network for Referring Image Segmentation
Linwei Ye,Mrigank Rochan,Zhi Liu,Yang Wang
16+阅读 · 2019年4月9日
Phase-aware Speech Enhancement with Deep Complex U-Net
Hyeong-Seok Choi,Jang-Hyun Kim,Jaesung Huh,Adrian Kim,Jung-Woo Ha,Kyogu Lee
15+阅读 · 2019年3月7日
RAM: Residual Attention Module for Single Image Super-Resolution
Jun-Hyuk Kim,Jun-Ho Choi,Manri Cheon,Jong-Seok Lee
6+阅读 · 2018年11月29日
Craig Macartney,Tillman Weyde
8+阅读 · 2018年11月27日
Di Lu,Spencer Whitehead,Lifu Huang,Heng Ji,Shih-Fu Chang
4+阅读 · 2018年11月7日
Tianli Liao,Jing Chen,Yifang Xu
4+阅读 · 2018年5月24日
Jialin Wu,Zeyuan Hu,Raymond J. Mooney
6+阅读 · 2018年5月22日
Peng Zhou,Xintong Han,Vlad I. Morariu,Larry S. Davis
8+阅读 · 2018年5月13日
Yiluan Guo,Ngai-Man Cheung
4+阅读 · 2018年4月2日
Fahim Irfan Alam,Jun Zhou,Alan Wee-Chung Liew,Xiuping Jia,Jocelyn Chanussot,Yongsheng Gao
10+阅读 · 2017年12月27日
相关资讯
Hierarchically Structured Meta-learning
CreateAMind
13+阅读 · 2019年5月22日
Call for Participation: Shared Tasks in NLPCC 2019
中国计算机学会
5+阅读 · 2019年3月22日
强化学习的Unsupervised Meta-Learning
CreateAMind
7+阅读 · 2019年1月7日
A Technical Overview of AI & ML in 2018 & Trends for 2019
待字闺中
10+阅读 · 2018年12月24日
Reinforcement Learning: An Introduction 2018第二版 500页
CreateAMind
9+阅读 · 2018年4月27日
条件GAN重大改进!cGANs with Projection Discriminator
CreateAMind
6+阅读 · 2018年2月7日
ResNet, AlexNet, VGG, Inception:各种卷积网络架构的理解
全球人工智能
15+阅读 · 2017年12月17日
【推荐】ResNet, AlexNet, VGG, Inception:各种卷积网络架构的理解
机器学习研究会
17+阅读 · 2017年12月17日
【推荐】自然语言处理(NLP)指南
机器学习研究会
33+阅读 · 2017年11月17日
Auto-Encoding GAN
CreateAMind
5+阅读 · 2017年8月4日
Top
微信扫码咨询专知VIP会员