Video denoising aims to recover high-quality frames from the noisy video. While most existing approaches adopt convolutional neural networks~(CNNs) to separate the noise from the original visual content, however, CNNs focus on local information and ignore the interactions between long-range regions in the frame. Furthermore, most related works directly take the output after basic spatio-temporal denoising as the final result, leading to neglect the fine-grained denoising process. In this paper, we propose a Dual-stage Spatial-Channel Transformer for coarse-to-fine video denoising, which inherits the advantages of both Transformer and CNNs. Specifically, DSCT is proposed based on a progressive dual-stage architecture, namely a coarse-level and a fine-level stage to extract dynamic features and static features, respectively. At both stages, a Spatial-Channel Encoding Module is designed to model the long-range contextual dependencies at both spatial and channel levels. Meanwhile, we design a Multi-Scale Residual Structure to preserve multiple aspects of information at different stages, which contains a Temporal Features Aggregation Module to summarize the dynamic representation. Extensive experiments on four publicly available datasets demonstrate our proposed method achieves significant improvements compared to the state-of-the-art methods.
翻译:视频脱去的目的是从噪音视频中恢复高质量框架。虽然大多数现有方法都采用进进进神经网络~(CNNs)将噪音与原始视觉内容分开,但CNN侧重于当地信息,忽视了框架中远程区域之间的相互作用;此外,大多数相关工作直接在基本spatio-时分分分分解作为最后结果后进行输出,结果忽视了细微分分解除去过程。在本文中,我们建议采用双阶段空间-气管变换器双阶段双阶段空间-气道变异器,用于相向相向相异视频分解,从而继承变异器和CNNIS的优势。具体地说,DSCT是根据一个渐进的双阶段结构,即粗不全级别和微级阶段,分别用来提取动态特征和静态特征。在这两个阶段,设计一个空间和频道两级的微分辨脱去过程,导致忽视细微的微去除过程。与此同时,我们设计一个多层次的残余结构,以保存不同阶段的多面信息,即继承变器和CN的优势。具体,具体地提议DCT基于一个渐进的两阶段结构,以渐进的两阶段结构,用以展示我们现有的数据制制制制模制制制制制制制制制制制制制制制制制制制制制制制制制制制,以制制制制制制制制制制制,以。